Statistical  Learning :  stability  is  sufficient  for  generalization 
and  necessary  and  sufficient  for  consistency  of  Empirical  Risk 

Minimization 

Sayan  Mukherjee^*,  Partha  Niyogi^,  Tomaso  Poggio^^  and  Ryan  Rifkin^t 

Center  for  Biological  Computation  and  Learningf  McGovern  Institute^  Artificial 
Intelligence  Labf  Center  for  Genome  Research^,  Whitehead  Institute^,  Brain  Sciences 
Department,  Massachusetts  Institute  of  Technology 

Department  of  Computer  Science  and  Statistics-^,  University  of  Chicago 

Honda  Research  Institute  U.  S.  A.  t 


The  October  2003  version  replaced  a  previous  revision  (April  2003)  of  a 
rather  flawed  CBCL  Paper  223  and  AI  Memo  2002-024,  Massachusetts 
Institute  of  Technology,  Cambridge,  MA,  December  2002.  This  January  2004 
version  corrects  some  of  the  remaining  typos  and  imprecisions. 


^To  whom  correspondence  should  be  addressed.  Email:  tp@ai.mit.edu 


Report  Documentation  Page 

Form  Approved 

0MB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 

1.  REPORT  DATE 

OCT  2003 

2.  REPORT  TYPE 

3.  DATES  COVERED 

00-10-2003  to  00-10-2003 

4.  TITLE  AND  SUBTITLE 

5a.  CONTRACT  NUMBER 

Statistical  Learning:  stability  is  sufficient  for  generalization  and 
necessary  and  sufficient  for  consistency  of  Empirical  Risk  Minimization 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Massachusetts  Institute  of  Technology, Center  for  Biological  and 
Computational  Learning, 77  Massachusetts  Avenue, Cambridge, MA, 02139 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR’S  ACRONYM(S) 

11.  SPONSOR/MONITOR’S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 
ABSTRACT 

18.  NUMBER 

OF  PAGES 

55 

19a.  NAME  OF 
RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Standard  Form  298  (Rev.  8-98} 

Prescribed  by  ANSI  Std  Z39-18 


Abstract 


Solutions  of  learning  problems  by  Empirical  Risk  Minimization  (ERM)  - 
and  almost-ERM  when  the  minimizer  does  not  exist  -  need  to  be  consis¬ 
tent,  so  that  they  may  be  predictive.  They  also  need  to  be  well-posed  in  the 
sense  of  being  stable,  so  that  they  might  be  used  robustly.  We  propose  a  sta¬ 
tistical  form  of  leave-one-out  stability,  called  CVEEEioo  stability.  Our  main 
new  results  are  two.  We  prove  that  for  bounded  loss  classes  CVEEEioo 
stability  is  (a)  sufficient  for  generalization,  that  is  convergence  in  probability 
of  the  empirical  error  to  the  expected  error,  for  any  algorithm  satisfying 
it  and,  (b)  necessary  and  sufficient  for  generalization  and  consistency  of  ERM. 
Thus  CVEEEioo  stability  is  a  weak  form  of  stability  that  represents  a  suf¬ 
ficient  condition  for  generalization  for  general  learning  algorithms  while 
subsuming  the  classical  conditions  for  consistency  of  ERM.  We  discuss  al¬ 
ternative  forms  of  stability.  In  particular,  we  conclude  that  for  ERM  a  cer¬ 
tain  form  of  well-posedness  is  equivalent  to  consistency. 
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1  Introduction 


In  learning  from  a  set  of  examples,  the  key  property  of  a  learning  algorithm  is 
generalization:  the  empirical  error  must  converge  to  the  expected  error  when  the 
number  of  examples  n  increases^.  An  algorithm  that  guarantees  good  general¬ 
ization  for  a  given  n  will  predict  well,  if  its  empirical  error  on  the  training  set 
is  small.  Empirical  risk  minimization  (ERM)  on  a  class  of  functions  H,  called 
the  hypothesis  space,  represents  perhaps  the  most  natural  class  of  learning  al¬ 
gorithms:  the  algorithm  selects  a  funcion  f  G  H  that  minimizes  the  empirical 
error  -  as  measured  on  the  training  set. 

Classical  learning  theory  was  developed  around  the  study  of  ERM.  One  of  its 
main  achievements  is  a  complete  characterization  of  the  necessary  and  suffi¬ 
cient  conditions  for  generalization  of  ERM,  and  for  its  consistency  (consistency 
requires  convergence  of  the  expected  risk  to  the  minimum  risk  achievable  by 
functions  in  H;  for  ERM  generalization  is  equivalent  to  consistency  [1]  and  thus 
for  ERM  we  will  often  speak  of  consistency  meaning  generalization  and  consis¬ 
tency).  It  turns  out  that  consistency  of  ERM  is  equivalent  to  a  precise  property 
of  the  h5q)othesis  space:  H  has  to  be  a  uniform  Glivenko-Cantelli  (uGC)  class  of 
functions  (see  later). 

Less  attention  has  been  given  to  another  requirement  on  the  ERM  solution  of 
the  learning  problem,  which  has  played  an  important  role  in  the  development 
of  several  learning  algorithms  but  not  in  learning  theory  proper.  In  general, 
empirical  risk  minimization  is  ill-posed  (for  any  fixed  number  of  training  ex¬ 
amples  n).  Any  approach  of  practical  interest  needs  to  ensure  well-posedness, 
which  usually  means  existence,  uniqueness  and  stability  of  the  solution.  The 
critical  condition  is  stability  of  the  solution;  in  this  paper  we  refer  to  well- 
posedness,  meaning,  in  particular,  stability.  In  our  case,  stability  refers  to  con¬ 
tinuous  dependence  on  the  n  training  data.  Stability  is  equivalent  to  some 
notion  of  continuity  of  the  learning  map  (induced  by  ERM)  that  maps  training 
sets  into  the  space  of  solutions,  eg  L  :  Z” 

As  a  major  example,  let  us  consider  the  following,  important  case  for  learning 
due  to  Cucker  and  Smale  [5].  Assume  that  the  h5q)othesis  space  His  a  compact 
subset  of  C{X)  with  X  a  compact  domain  in  Euclidean  space^.  Compactness 
ensures^  the  existence  of  the  minimizer  of  the  expected  risk  for  each  n  and,  if 
the  risk  functional  is  convex^  and  regularity  conditions  on  the  measure  hold, 
its  uniqueness  [5, 21].  Compactness  guarantees  continuity  of  the  learning  oper¬ 
ator  L,  measured  in  the  sup  norm  in  H  (see  section  2.4.3  ).  However,  compact¬ 
ness  is  not  necessary  for  well-posedness  of  ERM  (it  is  well-known,  at  least  since 

^The  precise  notion  of  generalization  defined  here  roughly  agrees  with  the  informal  use  of  the 
term  in  learning  theory. 

^Our  concept  of  generalization,  ie  convergence  in  probability  of  the  expected  error  I[fs]  to  the 
empirical  error  Is  [/s],  corresponds  to  the  uniform  estimate  of  the  "defect"  of  Theorem  B  in  [5]  (in 
their  setup);  consistency  of  ERM  corresponds  to  their  Theorem  C;  we  do  not  consider  in  this  paper 
any  result  equivalent  to  their  Theorem  C*. 

^Together  with  continuity  and  boundedness  of  the  loss  function  V. 

®For  convex  loss  function  V{f,z). 


2 


Tikhonov,  that  compactness  is  sufficient  but  not  necessary  for  well-posedness 
of  a  large  class  of  inverse  problems  involving  linear  operators.).  Inferesfingly, 
compacfness  is  a  sufficienf^  buf  nof  necessary  condition  for  consisfency  as  well 
[5]. 

Thus  if  is  nafural  fo  ask  fhe  question  of  whefher  fhere  is  a  definifion  of  well- 
posedness,  and  specifically  sfabilify  -  if  any  -  fhaf  is  sufficienf  fo  guarantee 
generalization  for  any  algorifhm.  Since  some  of  fhe  key  achievemenfs  of  learn¬ 
ing  fheory  revolve  around  fhe  condifions  equivalenf  fo  consisfency  of  ERM,  if 
is  also  nafural  fo  ask  whefher  fhe  same  notion  of  sfabilify  could  subsume  fhe 
classical  fheory  of  ERM.  In  ofher  words,  is  if  possible  fhaf  some  specific  form 
of  well-posedness  is  sufficienf  for  generalization  and  necessary  and  sufficienf 
for  generalization  and  consisfency^  of  ERM?  Such  a  result  would  be  surprising 
because,  a  priori,  fhere  is  no  reason  why  fhere  should  be  a  cormecfion  befween 
well-posedness  and  generalization  -  or  even  consisfency  (in  fhe  case  of  ERM): 
fhey  are  bofh  imporfanf  requiremenfs  for  learning  algorifhms  buf  fhey  seem 
quife  differenf  and  independenf  of  each  ofher. 

In  this  paper,  we  define  a  notion  of  stability  that  guarantees  generalization  and  in  the 
case  of  ERM  is  in  fact  equivalent  to  consistency. 

There  have  been  many  differenf  notions  of  sfabilify  fhaf  have  been  suggesfed  in 
fhe  pasf .  The  earliesf  relevanf  notion  may  be  fraced  fo  Tikhonov  where  sfabilify 
is  described  in  ferms  of  confinuify  of  fhe  learning  map  L.  In  learning  fheory, 
Devroye  and  Wagner  [7]  use  cerfain  nofions  of  algorifhmic  sfabilify  fo  prove 
fhe  consisfency  of  learning  algorifhms  like  fhe  A-nearesf  neighbors  classifier. 
More  recenfly  Kearns  and  Ron  [12]  invesfigafed  several  nofions  of  sfabilify  fo 
develop  generalization  error  bounds  in  ferms  of  fhe  leave  one  ouf  error.  Bous- 
quef  and  Elisseeff  [4]  showed  fhaf  uniform  hypothesis  stability  of  fhe  learning 
algorifhm  may  be  used  fo  provide  exponential  bounds  on  fhe  generalizafion 
error  wifhouf  recourse  fo  nofions  such  as  fhe  VC  dimension. 

These  various  nofions  of  algorifhmic  sfabilify  are  all  seen  fo  be  sufficienf  for 
(a)  fhe  generalizafion  capabilify  (convergence  of  fhe  empirical  fo  fhe  expected 
risk)  of  learning  algorifhms.  However,  until  recenfly,  if  was  unclear  whefher 
fhere  is  a  nofion  of  sfabilify  fhaf  (b)  is  also  bofh  necessary  and  sufficienf  for 
consisfency  of  ERM.  The  firsf  partial  resulf  in  fhis  direcfion  was  provided  by 
Kufin  and  Niyogi  [14]  who  infroduced  a  probabilistic  nofion  of  change-one 
sfabilify  called  Cross  Validation  or  CV  sfabilify.  This  was  shown  fo  be  necessary 
and  sufficienf  for  consisfency  of  ERM  in  fhe  Probably  Approximately  Correcf 
(PAC)  Model  of  Valianf  [24]. 

However,  fhe  fask  of  finding  a  correcf  characferizafion  of  sfabilify  fhaf  satisfies 
bofh  (a)  and  (b)  above  is  subfle  and  non-frivial.  In  Kufin  and  Niyogi  (2002)  [15] 
af  leasf  fen  differenf  nofions  were  examined.  An  answer  for  fhe  general  seffing, 
however,  was  nof  found. 

In  fhis  paper  we  give  a  new  definifion  of  sfabilify  -  which  we  call  Cross-validation, 
error  and  empirical  error  Leave-One-Out  stability  or,  in  shorf,  CVEEEioo  stability  - 

^Compactness  of  H  implies  the  uGC  property  of  Ti.  since  it  implies  finite  covering  numbers. 

^In  the  case  of  ERM  it  is  well  known  that  generalization  is  equivalent  to  consistency 


3 


of  the  learning  map  L.  This  definition  answers  the  open  questions  mentioned 
above. 

Thus,  our  somewhat  surprising  new  result  is  that  this  notion  of  stability  is  suf¬ 
ficient  for  generalization  and  is  both  necessary  and  sufficient  for  consistency  of 
ERM.  Consistency  of  ERM  is  in  turn  equivalent  to  H  being  a  uGC  class.  To  us 
the  result  seems  interesting  for  at  least  three  reasons: 

1.  it  proves  the  very  close  relation  between  two  different,  and  apparently 
independent,  motivations  to  the  solution  of  the  learning  problem:  con¬ 
sistency  and  well-posedness; 

2.  it  provides  a  condition  -  CVEEEjoo  stability  -  that  is  sufficient  for  gener¬ 
alization  for  any  algorithm  and  for  ERM  is  necessary  and  sufficient  not 
only  for  generalization  but  also  for  consistency.  CVEEE/oo  stability  may, 
in  some  ways,  be  more  natural  -  and  perhaps  an  easier  starting  point  for 
empirical  work®-  than  classical  conditions  such  as  complexity  measures 
of  the  h5q)othesis  space  H,  for  example  finiteness  of  Vj  or  VC  dimension; 

3.  it  provides  a  necessary  and  sufficient  condition  for  consistency  of  ERM 
that  -  unlike  all  classical  conditions  (see  Appendix  6.1)  -  is  a  condition  on 
the  mapping  induced  by  ERM  and  not  directly  on  the  hypothesis  space 
H. 

The  plan  of  the  paper  is  as  follows.  We  first  give  some  background  and  def¬ 
initions  for  the  learning  problem,  ERM,  consistency  and  well-posedness.  In 
section  3,  which  is  the  core  of  the  paper,  we  define  CVEEEjoo  stability  in  terms 
of  three  leave-one-out  stability  conditions:  crossvalidation  (CVjoo)  stability,  ex¬ 
pected  error  (E/oo)  stability  and  empirical  error  (EEjoo)  stability.  Of  the  three 
leave-one-out  stability  conditions,  CVioo  stability  is  the  key  one,  while  the  other 
two  are  more  technical  and  satisfied  by  most  reasonable  algorithms.  We  prove 
that  CVEEEjoo  stability  is  sufficient  for  generalization  for  general  algorithms. 
We  then  prove  in  painful  details  the  sufficiency  of  CVioo  stability  for  consis¬ 
tency  of  ERM  and  the  necessity  of  it.  Einally  we  prove  that  CVEEE/oo  stability 
is  necessary  and  sufficient  for  consistency  of  ERM.  We  also  discuss  alternative 
definitions  of  stability.  After  the  main  results  of  the  paper  we  outline  in  sec¬ 
tion  4  stronger  stability  conditions  that  imply  faster  rates  of  convergence  and 
are  guaranteed  only  for  "small"  uGC  classes.  Examples  are  h5q)othesis  spaces 
with  finite  VC  dimension  when  the  target  is  in  the  h5q)othesis  space  and  balls 
in  Sobolev  spaces  or  RKHS  spaces  with  a  sufficiently  high  modulus  of  smooth¬ 
ness.  We  then  discuss  a  few  remarks  and  open  problems:  they  include  stability 
conditions  and  associated  concentration  inequalities  that  are  equivalent  to  uGC 
classes  of  intermediate  complexity  -  between  the  general  uGC  classes  charac¬ 
terized  by  CVioo  stability  (with  arbitrary  rate)  and  the  small  classes  mentioned 
above;  they  also  include  the  extension  of  our  approach  to  non-ERM  approaches 
to  the  learning  problem. 

®In  its  distribution-dependent  version. 
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2  Background:  learning  and  ill-posed  problems 

For  notation,  definitions  and  some  results,  we  will  assume  knowledge  of  a 
foundational  paper  [5]  and  ofher  review  papers  [11, 17],  The  resulfs  of  [4,  14] 
are  fhe  sfarfing  poinf  for  our  work.  Our  inferesf  in  sfabilify  was  mofivafed  by 
fheir  papers  and  by  our  pasf  work  in  regularizafion  (for  reviews  see  [11,  20]). 

2.1  The  supervised  learning  problem 

There  is  an  unknown  probabilify  disfribufion  fi{x,y)  on  fhe  producf  space  Z  = 
X  X  Y.  We  assume  X  fo  be  a  compacf  domain  in  Euclidean  space  and  Y  fo 
be  a  closed  subsef  of  IR^.  The  measure  p  defines  an  unknown  true  function 
T{x)  =  fy  yd/i(ylx)  mapping  X  info  Y,  wifh  y{y\x)  fhe  conditional  probabilify 
measure  on  Y.  There  is  an  h5rpofhesis  space  H  of  functions  f  :  X  ^  Y. 

We  are  given  a  framing  sef  S  consisting  of  n  samples  (fhus  IS”!  =  n)  drawn  i.i.d. 
from  fhe  probabilify  disfribufion  on  Z": 

-S'  =  =  (2*)r=i- 

The  basic  goal  of  supervised  learning  is  fo  use  fhe  framing  sef  S  fo  "learn"  a 
function  fs  (in  H)  fhaf  evaluafes  af  a  new  value  Xnew  and  (hopefully)  predicfs 
fhe  associafed  value  of  y: 

ypred  —  f s{,^newf 

If  y  is  real-valued,  we  have  regression.  If  y  fakes  values  from  {—1,1},  we  have 
binary  pattern  classificafion.  In  fhis  paper  we  consider  only  symmefric  learn¬ 
ing  algorifhms,  for  which  fhe  function  outpuf  does  nof  depend  on  fhe  ordering 
in  fhe  fraining  sef. 

In  order  fo  measure  goodness  of  our  funcfion,  we  need  a  loss  function  V.  We 
denofe  by  V (/,  z)  (where  z  =  {x,  y))  fhe  price  we  pay  when  fhe  prediction  for 
a  given  x  is  f{x)  and  fhe  frue  value  is  y.  An  example  of  a  loss  funcfion  is  fhe 
square  loss  which  can  be  written 

V{f,z)  =  {f{x)-yf. 

In  fhis  paper,  we  assume  that  the  loss  function  V  is  the  square  loss,  fhough  mosf 
resulfs  can  be  exfended  fo  many  ofher  "good"  loss  functions.  Throughouf  fhe 
paper  we  also  require  that  for  any  f  and  z  £  Z  V  is  bounded,  0  <  V{f,z)  < 

M. 

Given  a  funcfion  /,  a  loss  funcfion  V,  and  a  probabilify  disfribufion  p  over  X, 
we  define  fhe  generalizafion  error  fhaf  we  call  here  true  error  of  f  as: 

/[/]  =  iE,y(/,z) 

which  is  also  fhe  expecfed  loss  on  a  new  example  drawn  af  random  from  fhe 
disfribufion.  In  fhe  case  of  square  loss 

/[/]  =  IE,G(/,  z)=  [  (fix)  -  yfdy{x,  y)  =  IE^|/  -  yp. 

JX,Y 


5 


The  basic  requirement  for  any  learning  algorithm  is  generalization:  the  empir¬ 
ical  error  must  be  a  good  proxy  of  the  expected  error,  that  is  the  difference 
between  the  two  must  be  "small".  Mathematically  this  means  that  for  the  func¬ 
tion  fs  selected  by  the  algorithm  given  a  training  set  S 

Ji^  |/[/s]  -  Is [/s]  1  =  0  in  probability. 

An  algorithm  that  guarantees  good  generalization  for  a  given  n  will  predict 
well,  if  its  empirical  error  on  the  training  set  is  small. 

In  the  following  we  denote  by  5®  the  training  set  with  the  point  Zi  removed  and 
training  set  with  the  point  Zi  replaced  with  z.  For  Empirical  Risk  Mini¬ 
mization,  the  functions  fs,  fs*,  and  fst.z  are  almost  minimizers  (see  Definition 
2.1)  of  Is[f],  Is*  [/]/  and  /gi.z  [/]  respectively.  As  we  will  see  later,  this  definition 
of  perturbation  of  the  training  set  is  a  natural  one  in  the  context  of  the  learning 
problem:  it  is  natural  to  require  that  the  prediction  should  be  asymptotically 
robust  against  deleting  a  point  in  the  training  set.  We  will  also  denote  by  S,  z 
the  training  set  with  the  point  z  added  to  the  n  points  of  S. 

2.2  Empirical  Risk  Minimization 

For  generalization,  that  is  for  correctly  predicting  new  data,  we  would  like  to 
select  a  function  /  for  which  /[/]  is  small,  but  in  general  we  do  not  know  fx  and 
carmot  compute  /[/]. 

In  the  following,  we  will  use  the  notation  P5  and  Eg  to  denote  respectively 
the  probability  and  the  expectation  with  respect  to  a  random  draw  of  the  train¬ 
ing  set  S  of  size  [S'!  =  n,  drawn  i.i.d  from  the  probability  distribution  on  Z". 
Similar  notation  using  expectations  should  be  self-explanatory. 

Given  a  function  /  and  a  training  set  S  consisting  of  n  data  points,  we  can 
measure  the  empirical  error  (or  risk)  of  f  as: 

n 

Is[f]  =  -J2^{f,zi) 
n 

When  the  loss  function  is  the  square  loss 

1  "" 

^s[f]  =  =^an{f  -y)^- 

n 

t—i 

where  pn  is  the  empirical  measure  supported  on  the  set  xi, ...,  a;„.  In  this  nota¬ 
tion  (see  for  example  [17])  ^  where  5xi  is  the  point  evaluation 

functional  on  the  set  Xi. 

Definition  2.1  Given  a  training  set  S  and  a  function  space  H,  we  define  almost-ERM 
(Empirical  Risk  Minimization)  to  be  a  symmetric  procedure  that  selects  a  function  fg 
that  almost  minimizes  the  empirical  risk  over  all  functions  f  GH,  that  is  for  any  given 
>  0;  ^ 

Is[ff]<}niIs[f]+£^-  (1) 
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In  the  following,  we  will  drop  the  dependence  on  in  /|  .  Notice  that  the 
term  "Empirical  Risk  Minimization"  (see  Vapnik  [25])  is  somewhat  misleading: 
in  general  the  minimum  need  not  exist®.  In  fact,  it  is  precisely  for  this  reason^® 
that  we  use  the  notion  of  almost  minimizer  or  £-minimizer,  given  in  Equation 
(1)  (following  others  e.g.  [1, 17]),  since  the  infimum  of  the  empirical  risk  always 
exists.  In  this  paper,  we  use  the  term  ERM  to  refer  to  almost-ERM,  unless  we 
say  otherwise. 

We  will  use  the  following  notation  for  the  loss  class  C  of  functions  induced  by  V 
and  Ti.  Eor  every  f  G  H,  let  £{z)  =  V (/,  z),  where  z  corresponds  to  x,  y.  Thus 
£{z)  :  X  xY  ^  IR  and  we  define  C  =  {£{f)  :  f  €  H,V}.  The  use  of  the  notation 
£  emphasizes  that  the  loss  function  t  is  a  new  function  of  z  induced  by  /  (with 
the  measure  yon  X  x  y)- 

2.3  Consistency  of  ERM  and  uGC  classes 

The  key  problem  of  learning  theory  was  posed  by  Vapnik  as  the  problem  of 
statistical  consistency  of  ERM  and  of  the  necessary  and  sufficient  conditions  to 
guarantee  it.  In  other  words,  how  can  we  guarantee  that  the  empirical  mini¬ 
mizer  of  Is  [/]  -  the  distance  in  the  empirical  norm  between  /  and  y  -  will  yield 
a  small  /[/]?  It  is  well  known  (see  [1])  that  convergence  of  the  empirical  error 
to  the  expected  error  guarantees  for  ERM  its  consistency. 

Our  definition  of  consistency^^  is: 

Definition  2.2  A  learning  map  is  (universally,  weakly)  consistent  if  for  any  given 

Ec  >  0 

lim  sup  P  I I[fs]  >  inf  /[/]  +  Eel  =  0.  (2) 

n^oo  ^  f(£H  J 

Universal  consistency  means  that  the  above  definition  holds  with  respect  to 
the  set  of  all  measures  on  Z.  Consistency  can  be  defined  with  respect  to  a  spe¬ 
cific  measure  on  Z.  Weak  consistency  requires  only  convergence  in  probability, 
strong  consistency  requires  almost  sure  convergence.  Eor  bounded  loss  func¬ 
tions  weak  consistency  and  strong  consistency  are  equivalent.  In  this  paper  we 
call  consistency  what  is  sometimes  defined  as  weak,  universal  consistency  [6]. 
The  work  of  Vapnik  and  Dudley  showed  that  consistency  of  ERM  can  be  en¬ 
sured  by  restricting  sufficiently  the  h5q)othesis  space  to  ensure  that  a  func¬ 
tion  that  is  close  to  a  target  T  for  an  empirical  measure  will  also  be  close  with 
respect  to  the  original  measure.  The  key  condition  for  consistency  of  ERM  can 
be  formalized  in  terms  of  uniform  convergence  in  probability  of  the  functions  £{z) 

^When  Ti.  is  the  space  of  indicator  functions,  minimizers  of  the  empirical  risk  exist,  because 
either  a  point  Xi  is  classified  as  an  error  or  not. 

'®It  is  worth  emphasizing  that  e-minimization  is  not  assumed  to  take  care  of  algorithm  complex¬ 
ity  issues  (or  related  numerical  precision  issues)  that  are  outside  the  scope  of  this  paper. 

'^The  definition  is  different  from  the  one  given  by  [6]  and  should  be  called  universal,  uniform 
weak  consistency. 
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induced  by  Ti  and  V.  Function  classes  for  which  there  is  uniform  convergence 
in  probability  are  called  uniform  Glivenko-Cantelli  classes  of  functions: 

Definition  2.3  Let  JFbea  class  of  functions.  T  is  a  (weak)  uniform  Glivenko-Cantelli 
class  if 

Ve  >  0  lim  sup  P  <  sup  |Eu„/  —  Pu/I  >  e  >  =  0.  (3) 

n^oo  ^  J 

There  may  be  measurability  issues  that  can  be  handled  by  imposing  mild  con¬ 
ditions  on  T  (see  [8, 9]). 

When  applied  to  the  loss  functions  I,  the  definition  implies  that  for  all  distri¬ 
butions  p  and  for  each  £„  there  exist  a  (5e„,n  such  that 

F  |sup  |/[f]  -  IsfW  >  £„]■  < 

Ue-F  J 

where  the  sequences  £„  and  go  simultaneously  to  zero  Later  in  the 
proofs  we  will  take  the  sequence  of  ef)  (in  the  definition  of  £-minimizer)  to  0 
with  a  rate  faster  than  therefore  faster  than  the  sequence  of  £„  (eg  the  £„  in 
the  uGC  definition). 

We  are  now  ready  to  state  the  "classical"  necessary  and  sufficient  condition  for 
consistency  of  ERM  (from  Alon  et  al..  Theorem  4.2,  part  3  [1],  see  also  [25,  9]). 

Theorem  2.1  Assuming  that  the  loss  functions  i  £  Care  bounded  and  the  collection 
of  functions  {t  —  inf^  i  :  £  £  C}  are  uniformly  bounded^^  ,  a  necessary  and  sufficient 
condition  for  consistency  of  ERM  (with  respect  to  all  measures)  is  that  C  is  uGC. 

We  observe  that  for  many  "good"  loss  functions  G  -  in  particular  the  square 
loss  -  with  £  bounded,  the  uGC  property  of  H  is  equivalent  to  the  the  uGC 
property  of 

Notice  that  there  is  a  definition  of  strong  uGC  classes  where,  instead  of  conver¬ 
gence  in  probability,  almost  sure  convergence  is  required. 

'^This  fact  follows  from  the  metrization  of  the  convergence  of  random  variables  in  probability 
by  the  Ky  Fan  metric  and  its  analogue  for  convergence  in  outer  probability.  The  rate  can  be  slow 
in  general  (Dudley,  pers.  com.). 

'^These  conditions  will  be  satisfied  for  bounded  loss  functions  0  <  £{z)  <  M 
'^Assume  that  the  loss  class  has  the  following  Lipschitz  property  for  all  x  g  A,  j/  g  Y,  and 
/i,/2  e  W: 

ci\V{fi{x),y)  -V{f2(x),y)\  <  |/i(x)  -  f2{x)\  <  C2\V{fi(x),y)  -  V(f2(x),y)\, 

where  0  <  ci  <  C2  are  Lipschitz  constants  that  upper  and  lower-bound  the  functional  difference. 
Then  £  is  uGC  iff  Ti.  is  uGC  because  there  are  Lipschitz  constants  that  upper  and  lower  bound  the 
difference  between  two  functions  ensuring  that  the  cardinality  of  Ti  and  £  at  a  scale  e  differ  by  at 
most  a  constant.  Bounded  Lp  losses  have  this  property  for  1  <  p  <  oo. 
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Definition  2.4  Let  The  a  class  of  functions.  T  is  a  strong  uniform  Glivenko-Cantelli 
class  if 

Ve  >  0  lim  sup  P  <  sup  sup  |1E^^/  —  P^/l  >  £  >  =  0.  (4) 

M  [rn>nfey^  J 

For  bounded  loss  functions  weak  uGC  is  equivalent  to  strong  uGC  (see  Theo¬ 
rem  6  in  [9])  and  weak  consistency  is  equivalent  to  strong  consistency  in  Theo¬ 
rem  2.1.  In  the  following,  we  will  speak  simply  of  uGC  and  consisfency,  mean¬ 
ing  -  sfricfly  speaking  -  weak  uGC  and  weak  consisfency. 

2.4  Inverse  and  Well-posed  problems 

2.4.1  The  classical  case 

Hadamard  infroduced  fhe  definifion  of  ill-posedness.  Ill-posed  problems  are 
offen  inverse  problems. 

As  an  example,  assume  g  is  an  elemenf  of  Z  and  u  is  a  funcfion  in  Ti,  wifh  Z 
and  Ti,  mefric  spaces.  Then  given  fhe  operafor  A,  consider  fhe  equation 

g  =  Au.  (5) 

The  direcf  problem  is  fo  compufe  g  given  u;  fhe  inverse  problem  is  fo  compufe 
u  given  fhe  dafa  g.  The  inverse  problem  of  finding  u  is  well-posed  when 

•  fhe  solution  exisfs, 

•  is  unique  and 

•  is  stable,  fhaf  is  depends  confinuously  on  fhe  initial  dafa  g.  In  fhe  example 
above  fhis  means  fhaf  has  fo  be  continuous.  Thus  sfabilify  has  fo  be 
defined  in  ferms  of  fhe  relev anf  norms. 

Ill-posed  problems  (see  [10])  fail  fo  satisfy  one  or  more  of  fhese  criferia.  In  fhe 
liferafure  fhe  ferm  ill-posed  is  offen  used  for  problems  fhaf  are  not  stable,  which 
is  fhe  key  condifion.  In  Equation  (5)  fhe  map  A~^  is  confinuous  on  ifs  domain 
Z  if,  given  any  £  >  0,  fhere  is  a  <5  >  0  such  fhaf  for  any  z',  z"  G  Z 

\\z'  -z''\\ <  5 


wifh  fhe  norm  in  Z,  fhen 


11^ 


-A 


<  £, 


wifh  fhe  norm  in  7i. 

The  basic  idea  of  regularization  for  solving  ill-posed  problems  is  fo  restore  ex¬ 
istence,  uniqueness  and  sfabilify  of  fhe  solution  by  an  appropriafe  choice  of 
Ti  (fhe  h5q)ofhesis  space  in  fhe  learning  framework).  Usually,  exisfence  can  be 
ensured  by  redefining  fhe  problem  and  uniqueness  can  offen  be  restored  in 
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simple  ways  (for  instance  in  the  learning  problem  we  choose  randomly  one  of 
the  several  equivalent  almost  minimizers).  However,  stability  of  the  solution  is 
usually  much  more  difficult  to  guarantee.  The  regularization  approach  has  its 
origin  in  a  topological  lemma^®  that  under  certain  conditions  points  to  the  com¬ 
pactness  of  Ti.  as  sufficient  for  establishing  stability  and  thus  well-posedness^^. 
Notice  that  when  the  solution  of  Equation  (5)  does  not  exist,  the  standard  ap¬ 
proach  is  to  replace  it  with  the  following  problem,  analogous  to  ERM, 

min  \\Au-g\\  (6) 

u£tL 

where  the  norm  is  in  Z.  Assuming  for  example  that  Z  and  Ti  are  Hilbert  spaces 
and  A  is  linear  and  continuous,  the  solutions  of  Equation  (6)  coincide  with  the 
solutions  of 

Au  =  Pg  (7) 

where  P  is  the  projection  onto  i?(A). 


2.4.2  Classical  framework:  regularization  of  the  learning  problem 

Eor  the  learning  problem  it  is  clear,  but  often  neglected,  that  ERM  is  in  general 
ill-posed  for  any  given  S'„.  ERM  defines  a  map  L  which  maps  any  discrete  data 
S  =  {{xi,yi), {Xn,  Un))  into  a  function  /,  that  is 

LS  =  fs. 

In  Equation  (5)  L  corresponds  to  A~^  and  g  to  the  discrete  data  S.  In  general, 
the  operator  L  induced  by  ERM  carmot  be  expected  to  be  linear.  In  the  rest 
of  this  subsection,  we  consider  a  simple,  "classical"  case  that  corresponds  to 
Equation  (7)  and  in  which  L  is  linear. 

Assume  that  the  x  part  of  the  n  examples  {x\,  ...,Xn)  is  fixed;  then  L  as  an 
operator  on  (?/i , . . . ,  )  can  be  defined  in  terms  of  a  set  of  evaluation  functionals 
Fi  on  that  is  pi  =  Fi{u).  If  is  a  Hilbert  space  and  in  it  the  evaluation 
functionals  Fi  are  linear  and  bounded,  then  His  a  RKHS  and  the  Fi  can  be  written 
as  Fi{u)  =  {u,  where  K  is  the  kernel  associated  with  the  RKHS  and  we 

use  the  irmer  product  in  the  RKHS.  Eor  simplicity  we  assume  that  K  is  positive 
definite  and  sufficiently  smooth  [5, 26].  The  ERM  case  corresponds  to  Equation 
(6)  that  is 

1  ” 

2=1 

Compactness  is  ensured  by  enforcing  the  solution  /  -  which  has  the  form 
/(^)  =  CiAr(xi,x)  since  it  belongs  to  the  RKHS  -  to  be  in  the  ball  Br 

'^Lemma  (Tikhonov,  [23])  If  operator  A  maps  a  compact  set  Ti.  C  H  onto  Z  C  Q,  H  and  Q  metric 
spaces,  and  A  is  continuous  and  one-to-one,  then  the  inverse  mapping  is  also  continuous. 

learning,  the  approach  underlying  most  algorithms  such  as  RBF  and  SVMs  is  in  fact  regular¬ 
ization.  These  algorithms  can  therefore  be  directly  motivated  in  terms  of  restoring  well-posedness 
of  the  learning  problem. 
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of  radius  i?  in  (eg  \\f\\K  <  R)-  Then  H  =  Ik{Br)  is  compact  -  where 
Ik  ■  Hr  ^  C'{X)  is  the  inclusion  and  C{X)  is  the  space  of  continuous  func¬ 
tions  with  the  sup  norm  [5].  In  this  case  the  minimizer  of  the  generalization  er¬ 
ror  /[/]  is  well-posed.  Minimization  of  the  empirical  risk  (Equation  (8))  is  also 
well-posed:  it  provides  a  set  of  linear  equations  to  compute  the  coefficients  c 
of  the  solution  /  as 

Kc  =  y  (9) 

wherey  =  (yi, and  {K)ij  =  K{xi,Xj). 

A  particular  form  of  regularization,  called  Tikhonov  regularization,  replaces 
ERM  (see  Equation  (8))  with 

n 

min-^(/(x,)-2/,)2+7||/||^,  (10) 

fen  n 

which  gives  the  following  set  of  equations  for  c  (with  7  >  0) 

(KAn')l)c  =  y,  (11) 


which  for  7  =  0  reduces  to  Equation  (9).  In  this  RKHS  case,  stability  of  the 
empirical  risk  minimizer  provided  by  Equation  (10)  can  be  characterized  us¬ 
ing  the  classical  notion  of  condition  number  of  the  problem.  The  change  in  the 
solution  /  due  to  a  variation  in  the  data  y  can  be  bounded  as 


IIA/I 

ll/ll 


<||i^ +  717/11  (/^  +  ^7/) 


-1 


llAyll 

llyll  ’ 


(12) 


where  the  condition  number  \\K  +  njl\\ \\{K  +  717/)“^ ||  is  controlled  by  ny.  A 
large  value  of  ny  gives  condition  numbers  close  to  1,  whereas  ill-conditioning 
may  result  if  7  =  0  and  the  ratio  of  the  largest  to  the  smallest  eigenvalue  of  K 
is  large. 


Remarks: 

1.  Equation  (8)  for  any  fixed  n  corresponds  to  the  set  of  well-posed,  linear 
equations  (9),  even  without  the  constraint  ||/|1|'  <  R'-  if  AT  is  symmetric 
and  positive  definite  and  the  Xi  are  distinct  then  K~^  exists  and  ||/|j^ 
is  automatically  bounded  (with  a  bound  that  increases  with  n).  Eor  any 
fixed  n,  the  condition  number  is  finite  but  typically  increases  with  n. 

2.  Minimization  of  the  functional  in  Equation  (10)  with  7  >  0  implicitly 
enforces  the  solution  to  be  in  a  ball  in  the  RKHS,  whose  radius  can  be 
bounded  "a  priori"  before  the  data  set  S  is  known  (see  [18]). 
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2.4.3  Stability  of  learning:  a  more  general  case 


The  approach  to  defining  stability  described  above  for  the  RKHS  case  cannot 
be  used  directly  in  the  more  general  setup  of  the  supervised  learning  problem 
introduced  in  section  2.1.  In  particular,  the  training  set  Sn  is  drawn  i.i.d.  from 
the  probability  distribution  on  Z,  the  Xi  are  not  fixed  and  we  may  not  even 
have  a  norm  in  H  (in  the  case  of  RKHS  the  norm  in  H  bounds  the  sup  norm). 

The  probabilistic  case  for  H  with  the  sup  norm 

A  (distribution-dependent)  definition  of  stability  that  takes  care  of  some  of  the 
issues  above  was  introduced  by  [4]  with  the  name  of  uniform  stability: 

VSGZ",Vze{l,...,n}  snp\V{fs,z)-V{fsi,z)\<(3,  (13) 

z&Z 


Kutin  and  Niyogi  [14]  showed  that  ERM  does  not  in  general  exhibit  Bousquet 
and  Elisseeff's  definition  of  uniform  stability.  Therefore  they  extended  it  in  a 
probabilistic  sense  with  the  name  of  (/!,  6)-hypothesis  stability,  which  is  a  natural 
stability  criterion  for  h5q)othesis  spaces  equipped  with  the  sup  norm.  We  give 
here  a  slightly  different  version: 

Fs|sup|K(/s,z)-K(/si,z)|  <d|  >  l-<5,  (14) 

where  (3  and  6  go  to  zero  with  n  — >  oo. 

Interestingly,  the  results  of  [4]  imply  that  Tikhonov  regularization  algorithms 
are  uniformly  stable  (and  of  course  (d,  (l)-h5q)Othesis  stable)  with  (3  =  0{^). 
Thus,  this  definition  of  stability  recovers  the  key  parameters  for  good  condi¬ 
tioning  number  of  the  regularization  algorithms.  As  discussed  later,  we  con¬ 
jecture  that  in  the  case  of  ERM,  {(3,  (5)-h5q)othesis  stability  is  related  to  the  com¬ 
pactness  of  TT  with  respect  to  the  sup  norm  in  C'(A). 

A  more  general  definition  of  stability 

The  above  definitions  of  stability  are  not  appropriate  for  h5q)othesis  spaces  for 
which  the  sup  norm  is  not  meaningful,  at  least  in  the  context  of  the  learning 
problem  (for  instance,  for  h5rpothesis  spaces  of  indicator  functions).  In  addi¬ 
tion,  it  is  interesting  to  note  that  the  definitions  of  stability  introduced  above  - 
and  in  the  past  -  are  not  general  enough  to  be  equivalent  to  the  classical  neces¬ 
sary  and  sufficient  conditions  on  H  for  consistency  of  ERM.  The  key  ingredient 
in  our  definitions  of  stability  given  above  is  some  measure  on  ~  ^/si  1/  £§ 
a  measure  of  the  difference  between  the  error  made  by  the  predictor  obtained 
by  using  ERM  on  the  training  set  S  vs.  the  error  of  the  predictor  obtained 
from  a  slightly  perturbed  training  set  S'*.  Since  we  want  to  deal  with  spaces 
without  a  topology,  we  propose  here  the  following  definitiorf^  of  leave-one-out 
cross-validation  (in  short,  CVioo)  stability,  which  is  the  key  part  in  the  notion  of 
CVEEEjoo  stability  introduced  later: 

'^The  definition  is  given  here  in  its  distribution-dependent  form. 
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Vz  G  {1, . . .  ,n}  Ps  {|F(/s,Zi)  -  V{fsi,Zi)\  <  Pcv}  >  1  -  5cv, 

Here  we  measure  the  difference  between  the  errors  at  a  point  Zi  which  is  in  the 
training  set  of  one  of  fhe  predicfors  buf  nof  in  fhe  framing  sef  of  fhe  ofher.  No- 
fice  fhaf  fhe  definifions  of  sfabilify  we  discussed  here  are  progressively  weaker: 
a  good  condifion  number  (for  increasing  n)  implies  good  uniform  sfabilify^®. 
In  fums,  uniform  stability  implies  ((3, 5)-hypothesis  stability  which  implies  CV loo  sta¬ 
bility.  For  fhe  case  of  supervised  learning  all  fhe  definifions  capfure  fhe  basic 
idea  of  sfabilify  of  a  well-posed  problem:  fhe  funcfion  "learned"  from  a  fram¬ 
ing  sef  should,  wifh  high  probabilify,  change  little  in  ifs  poinfwise  predictions 
for  a  small  change  in  fhe  framing  sef,  such  as  deletion  of  one  of  fhe  examples. 


Remarks: 

1.  In  fhe  learning  problem,  uniqueness  of  fhe  solution  of  ERM  is  always 
mean!  in  ferms  of  uniqueness  of  £  and  fherefore  uniqueness  of  fhe  equiv¬ 
alence  class  induced  in  H  by  fhe  loss  funcfion  V.  In  ofher  words,  mulfiple 
f  G  H  may  provide  fhe  same  £.  Even  in  fhis  sense,  ERM  on  a  uGC  class  is 
nof  guaranfeed  fo  provide  a  unique  "almosf  minimizer".  Uniqueness  of 
an  almosf  minimizer  fherefore  is  a  rafher  weak  concepf  since  uniqueness 
is  valid  modulo  the  equivalence  classes  induced  by  fhe  loss  funcfion  and  by 
£-  minimizafion. 

2.  Sfabilify  of  algorithms  is  almosf  always  violafed,  even  in  good  and  use¬ 
ful  algorifhms  (Smale,  pers.  comm.).  In  fhis  paper,  we  are  nof  concerned 
abouf  sfabilify  of  algorifhms  buf  stability  of  problems.  Our  nofions  of  sfabil¬ 
ify  of  fhe  map  L  are  in  fhe  same  spirif  as  fhe  condifion  number  of  a  linear 
problem,  which  is  independenf  of  fhe  algorifhm  fo  be  used  fo  solve  if.  As 
we  discussed  earlier,  bofh  CVioo  sfabilify  and  uniform  sfabilify  can  be  re¬ 
garded  as  exfensions  of  fhe  notion  of  condifion  number  (for  a  discussion 
in  fhe  confexf  of  inverse  ill-posed  problems  see  [2]). 


3  Stability,  generalization  and  consistency  of  ERM 

3.1  Probabilistic  preliminaries 

The  following  are  easy  consequences  of  our  definifions  and  will  be  used  wifh- 
ouf  furfher  mention  fhroughouf  fhe  paper: 

TEs[I[fs]]=TEsAV{fs,z)] 

VzG  n}  IEs[/s[/s]]  =IEs[U(/s,2.)] 

'®Note  that  n-y  which  controls  the  quality  of  the  condition  number  in  regularization  also  controls 
the  rate  of  uniform  stability. 
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Vz  e  {1, . . . ,  n}  IEs[J[/s.]  =  ]Es[V{fsi,z,)] 


3.2  Leave-one-out  stability  properties 

This  section  introduces  several  definitions  of  sfabilify  -  all  in  fhe  leave-one- 
ouf  form  -  and  show  fhe  equivalence  of  fwo  of  fhem.  The  firsf  definition  of 
sfabilify  of  fhe  learning  map  L,  is  Cross-Validation  Leave-One-Out  stability  which 
fums  ouf  fo  be  fhe  cenfral  one  for  mosf  of  our  resulfs.  This  nofion  of  sfabilify  is 
based  upon  a  variafion  of  a  definifion  of  sfabilify  infroduced  in  [14]. 


3.2.1  CV loo  stability 


Definition  3.1  The  learning  map  L  is  distribution-independent,  CVioo  stable  if  for 
each  n  there  exists  a  j3^y  and  a  S^y  such  that 

VzG{l,...,n}  Vp  ]Ps{\V{fsi,Zi)-V{fs,Zi)\<(3^ffl}>l-S^ffl, 

with  (d^ffy  and  S^ffy  going  to  zero  for  n  ^  oo. 


Notice  that  our  definition  of  fhe  sfabilify  of  L  depends  on  fhe  poinfwise  value 
of  \V{fs,Zi)  —  V{fsi,  Zi)\.  This  definifion  is  much  weaker  fhan  fhe  uniform 
sfabilify  condition  [4]  and  is  implied  by  if. 

A  definifion  which  furns  ouf  fo  be  equivalenf  was  infroduced  by  [4]  (see  also 
[12])  under  fhe  name  of  pointwise  hypothesis  stability  or  PH  stability,  which  we 
give  here  in  ifs  disfribufion-free  version: 

Definition  3.2  The  learning  map  L  has  distribution-independent,  pointwise  hypoth¬ 
esis  stability  if  for  each  n  there  exists  a 

VzG{l,...,n}  Vp  ILs[\V{fs,zi)-V{fs^,Zi)\]<pPH, 
with  going  to  zero  for  n  ^  oo. 

We  now  show  that  the  two  definitions  of  CVioo  sfabilify  and  PH-sfabilify  are 
equivalenf  (in  general,  wifhouf  assuming  ERM). 

Lemma  3.1  CVioo  stability  with  jdioo  and  6ioo  implies  PH  stability  with  jdpH  = 
fdioo  +  Mdioo',  PH  stability  with  fdpH  implies  CVioo  stability  with  {a,  for  any 
ot  <  (dpH- 

Proof: We  give  fhe  proof  for  a  given  disfribufion.  The  resulf  exfends  frivially 
fo  fhe  disfribufion-free  (Vp)  case.  From  fhe  definifion  of  ON loo  sfabilify  and  fhe 
bound  on  fhe  loss  function  if  follows 

Vz  G  {l,...,n}  IEs[|R(/s,Zj)  -  V{fsi,Zi)\]  <  Pwo  d-  MSioo- 
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This  proves  CNioo  stability  implies  PH  stability. 

From  the  definition  of  PH  sfabilify,  we  have 

^s[\V{fs^,Z,)-V{fs,Zi)\]<PpH 


Since  \  V{fsi,Zi)  —  V {fs,  Zi) \  >  0,  by  Markov's  inequalify,  we  have 


]P[\V{fsi,Zi)-V{fs,z,)\>a]  < 


^s[\V{fsuZi)-V{fs,z.)\] 

a 


Pph 

a 


From  fhis  if  is  immediafe  fhaf  fhe  learning  algorifhm  is  {a,  CV loo  sfable. 

□ 


3.2.2  E/oo  and  EE/oo  stability 

We  now  introduce  two  rather  weak  and  natural  stability  conditions  which,  un¬ 
like  CV  loo  stability,  are  not  pointwise  and  should  be  satisfied  by  mosf  reason¬ 
able  algorifhm.  The  firsf  one  -  infroduced  and  used  by  [12]  -  is  leave-one-ouf 
sfabilify  of  fhe  expecfed  error;  fhe  second  one  -  which  is  similar  fo  fhe  overlap 
stability  of  Kufin  and  Niyogi  -  is  leave-one-ouf  sfabilify  of  fhe  empirical  error. 
They  are  bofh  very  reasonable  from  fhe  poinf  of  view  of  wellposedness  of  fhe 
problem.  In  parficular,  for  ERM  if  would  be  indeed  inconsisfenf  if  sfabilify  of 
fhe  expecfed  and  empirical  error  were  nof  frue.  Nofice  fhaf  uniform  stability, 
which  is  fhe  'normal"  definifion  of  confinuify  for  sfabilify  (sfabilify  in  fhe  L^o 
norm),  implies  bofh  Ejoo  and  EE/oo  sfabilify.  In  facf  (/?,  (5)-h5q)ofhesis  sfabilify 
immediafely  gives  CVioo  sfabilify,  h5q)ofhesis  sfabilify,  E/oo  sfabilify,  and  EE/oo 
sfabilify. 

Definition  3.3  The  learning  map  L  is  distribution-independent,  Error  stable  -  in 
short  Eioo  stable  -  if  for  each  n  there  exists  a  and  a  such  that  for  alii  =  l...n 

Vp  Ps  {|/[/s^]  -  /[/S]|  <  fPr)  >  1  - 
with  and  going  to  zero  for  n  ^  oo. 


Definition  3.4  The  learning  map  L  is  distribution-independent.  Empirical  error  sta¬ 
ble  -  in  short  EEioo  stable  -  if  for  each  n  there  exists  a  and  a  (5^^  such  that  for  all 

i  =  l...n 

Vp  P5  { 144/5^]  -  Is[fs]\  <  41}  >  1  -  41, 

with  ond  ^4  zero  for  n  ^  oo. 
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Since  the  loss  function  is  bounded  by  M  an  equivalent  definition  of  EEioo  stability 
is:  for  each  n  there  exists  a  and  a  such  that  for  all  i  =  l...n 

Wy  Ps  {|/s[/s^]  -  Is[fs]\  <  41}  >  1  -  4”i 

with  and  ^4  zero  for  n  ^  oo. 

The  in  the  two  variants  of  the  definition  are  within  ^  of  each  other. 


3.2.3  CVEEE/oo  stability 

As  we  will  show,  the  combination  of  CVioo,  Ejoo  and  EEjoo  stability  is  sufficient 
for  generalizafion  for  generic,  symmefric  algorifhms  and  is  necessary  and  suf- 
ficienf  for  consisfency  of  ERM.  The  following  definition  will  be  useful 

Definition  3.5  When  a  learning  map  L  exhibits  CV loo,  Eioo  and  EEioo  stability,  we 
will  say  that  it  has  CVEEEioo  stability. 

Notice  that  uniform  sfabilify  implies  CVEEE/oo  sfabilify  buf  is  nof  implied  by 
if. 

3.3  CVEEE/oo  stability  implies  generalization 

In  fhis  section  we  prove  that  CVEEE/oo  stability  is  sufficient  for  generalization 
for  general  learning  algorithms. 

We  first  prove  the  following  useful  Lemma. 

Lemma  3.2  Given  the  following  expectation 

^x[A-x{Bx  —  Cxfl, 

with  the  random  variables  0  <  Ax,  Bx,Cx  <  M  and  random  variables  0  <  < 

M  where 


“  Ax  1 

>fA) 

< 

6a 

^x{\B'x- 

-Bx\ 

>  Pb) 

< 

6b 

^x(\C'x  - 

-Cx\ 

>  Pc) 

< 

6c, 

then 

dEx[MBx-Cx)]  <  dEx[A'x{B'x-C'x)]  + 

iMfdA  +  5M^5a  +  MfdB  +  M^Sb  +  M/dc  +  M'^Sq- 
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Proof: 

We  first  define  the  following  three  terms 


^B,x  =  Bx  —  B'^ 

Ac,.  =  C'x-Cx. 

We  now  rewrite  the  expectation  and  use  the  fact  that  the  random  variables  are 
nonnegative  and  bounded  by  M 

lE.[d..(_B.  —  (7.)]  =  +  /S.a,x){.B'^  +  A.b,x  ~  +  Ac,.)] 

<  IE.[A;(b;  -  C'x)]  +  3MIE.|Aa,.|  +  ME.  I  As,.  I  +  2ME.|Ac..|. 


By  the  assumptions  given 

|A/i,.|  <  [3a  with  probability  1  —  5^ 

|Ab,.|  <  13b  with  probability  1  — 

I  Ac,.  I  <  [3c  with  probability  1  —  Sc- 

Sets  of  X  for  which  |A,4,.|  <  (3a  are  called  G  (the  fraction  of  sets  for  which  this 
holds  is  1  —  (5n)  while  the  complement  is  called 

1E.|A^,.|  =  E.gcjAA,.]  +  E.gc^jAA,.] 

A  (1  ~  Sa)(3a  +  MSa 

<  Pa  +  MSa- 


Therefore, 

lE.|Ayi,.|  +  E.|Ab,.|  +  E.|Ac,.|  A  Pa  +  MSa  +  Pb  +  MSb  +  Pc  +  MSc-  Q 

We  now  prove  that  CVEEEjoo  stability  implies  generalization  for  symmetric 
algorithms. 

Theorem  3.1  If  the  symmetric  learning  map  is  CVEEE  loo  stable  then  with  probability 

1  Sg^ji 

\I[fs]  -  -fs[/s]|  <  Pgen, 

where 

Sgen  =  Pgen  =  {2M Pcv +2M'^ ScV +  3M PEr  +  iM"^ S Er  +  I>M PeE  +  HM'^ S eeY^ ^ - 

Proof: 

We  first  state  the  properties  implied  by  the  assumptions.  In  all  cases  the  prob¬ 
ability  is  over  S,  z' . 

CV  stability:  with  probability  1  —  Scv 

\V{fs,z')-V{fs,xcz')\<Pcv- 
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Error  stability:  with  probability  1  —  Ssr 

\^^V{fs,Z^)-^.V{fs,.',z)\  <  PEr. 
Empirical  error  stability:  with  probability  1  —  See 


ij2v{fs,Z,)-^  E  V{fs,.,,Z,)\<PEE. 

ZjGS  zjGS.z' 


Let  us  consider 


IEs(/[/s]-/s[/s])"  =  TEs{I[fsr  +  Is[fs?-2I[fs]Is[fs]) 

=  IEs[/[/s](/[/s]  -  Islfs])]  +  lEs[/s[/s](/s[/s]  -  /[/s])]- 

We  will  only  upper  bound  the  two  terms  in  the  expansion  above  since  a  trivial 
lower  bound  on  the  above  quantity  is  zero. 

We  first  bound  the  first  term 


lEs 


=  lEs.z 


/  1 

M,V{fs,z)  IE,W(/s,z')  -  -  E  V(fs,^i) 

rt 


Zj^S 


IE.y(/s,z)  I  V{fs,z')  --J2  Vifs,Zj) 

71 


Zj^S 


Given  the  stability  assumptions  and  Lemma  3.2 


lEs,; 


<  lEs; 


WUs,z)  I  V{fs,z')  -  -  ^  Vifs.z,) 


Zj^S 


rt  -U  f  ^ 


n  +  1 


ZjGSfZ' 

+?>M  Pet  +  SM^Set  +  M  Pcv  +  M'^Scv  +  MPee  +  M'^See 

1 


V{fs,z',z)  Vifs,,,,z')-^  E  nfs,z',z,) 


ZjGS,z' 

+3MPEr  +  HM^Set  +  M  Pcv  +  M'^Scv  +  MPee  +  M'^See 


By  symmetry 


lEs,; 


V{fs,z',z)[v{fs,v,z')-^  E  nfs,v,z,) 


Zj£S,z' 


=  0, 
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since  the  expectation  of  the  two  terms  in  the  inner  parentheses  is  identical:  both 
are  measuring  the  error  on  a  training  sample  from  dafasefs  of  n  +  1  samples 
drawn  i.i.d.  This  bounds  fhe  firsf  ferm  as  follows 

lEs[f^[/s](T[/s]  —  /s[/s])]  <  M{f3cv  +  M6cv  +  SPet  +  SMSet  +  Pee  +  M5ee)- 


Combining  fhe  above  ferms  gives 

lEs(/[/s]  -  Islfs])"^  <  B, 

where 

B  =  2MPcv  +  2M^Scv  +  ’^J^Pet  +  SM^Set  +  5MPee  +  5M^6ee- 
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By  the  Bienayme-Chebyshev  inequality  it  follows  that 

n\nfs]-is[fs]\>s)<^ 

for  nonnegafive  S,  implying  fhaf,  wifh  probabilify  1  —  6, 

\nfs]  <  /s[/s]i  + 

Setting  6  =  Vb  gives  us  fhe  resulf,  eg  wifh  probabilify  1  —  6gen 


\I[fs]-Is[fs]\<Pgen, 


where 


6  gen  —  f6g 


5^/4.  □ 


Remarks: 


1.  CN loo,  E/oo  and  EEjoo  sfabilify  fogefher  are  sfrong  enough  fo  imply  gen- 
eralizafion  for  general  algorifhms,  buf  neifher  condifion  by  ifself  is  suffi- 
cienf. 


2.  CV/oo  sfabilify  by  ifself  is  not  sufficienf  for  generalizafion,  as  fhe  following 
counferexample  shows.  Lef  X  be  uniform  on  [0, 1].  Lef  Y  G  {—1,  !}•  Lef 
fhe  "fargef  function"  be  f* (x)  =  1,  and  fhe  loss-funcfion  be  fhe  (0,  l)-loss. 

Given  a  framing  sef  of  size  n,  our  (non-ERM)  algorifhm  ignores  fhe  y 
values  and  produces  fhe  following  function: 

^  /  “1"  if  n  is  a  framing  poinf 

Js[x)  —  _2^n+i  ofherwise. 


Now  consider  whaf  happens  when  we  remove  a  single  framing  poinf  fo 
obfain  fs, .  Clearly, 

f  ('r\  =  l  ilx  =  Xi 

[  —fs{x)  ofherwise. 

In  ofher  words,  when  we  remove  a  framing  poinf,  fhe  value  of  fhe  outpuf 
function  swifches  af  every  poinf  excepf  fhaf  framing  poinf.  The  value 
af  fhe  framing  poinf  removed  does  nof  change  af  all,  so  fhe  algorifhm  is 
{Pc,  6c)  CVioo  sfable  wifh  (3c  =  6c  =  0.  However,  fhis  algorifhm  does 
nof  generalize  af  all;  for  every  framing  sef,  depending  on  fhe  size  of  fhe 
sef,  eifher  fhe  framing  error  is  0  and  fhe  fesfing  error  is  1,  or  vice  versa. 

3.  E/oo  and  EE/oo  sfabilify  by  fhemselves  are  not  sufficienf  for  generalizafion, 
as  fhe  following  counferexample  shows.  Using  fhe  same  sefup  as  in  fhe 
previous  remark,  consider  an  algorifhm  which  refurns  1  af  every  fram¬ 
ing  poinf,  and  -1  ofherwise.  This  algorifhm  is  EE/oo  and  E/oo  sfable  (and 
hypofhesis  sfable),  buf  is  nof  CV/oo  sfable  and  does  nof  generalize. 
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4.  In  [4],  Theorem  11,  Elisseef  and  Bousquet  claim  that  PH  stability  (which 
is  equivalent  to  our  CVioo  stability,  by  Lemma  3.1)  is  sufficient  for  gen- 
eralizafion.  However,  fhere  is  an  error  in  fheir  proof.  The  second  line  of 
fheir  fheorem,  franslafed  info  our  nofafion,  sfafes  correcfly  fhaf 

^sA\Vifs,z,)-V{fs...,z,)\]  <  lEs[\V{fs,Zi)-V{fsi,zA] 

+  lEs[\V{fsuZi)-V{fsi.,zA]- 

Bousquef  and  Elisseef  use  PH  sfabilify  fo  bound  bofh  ferms  in  fhe  ex¬ 
pansion.  While  fhe  firsf  ferm  can  be  bounded  using  PH  sfabilify,  fhe  sec¬ 
ond  ferm  involves  fhe  difference  in  performance  on  Zi  between  functions 
generated  from  fwo  differenf  fesf  sefs,  neifher  of  which  confain  fhis 
cannof  be  bounded  using  PH  sfabilify.  The  Elisseef  and  Bousquef  proof 
can  be  easily  "fixed"  by  bounding  fhe  second  ferm  using  fhe  more  gen¬ 
eral  nofion  of  (non-poinfwise)  h5rpofhesis  sfabilify;  fhis  would  fhen  prove 
fhaf  fhe  combinafion  of  CVioo  sfabilify  and  h5q)ofhesis  sfabilify  are  suffi- 
cienf  for  generalizafion,  which  also  follows  direcfly  from  proposifion  3.1. 
H5q)ofhesis  sfabilify  is  a  sfricfly  sfronger  nofion  fhan  error  sfabilify  and 
implies  if.  EloOerr  sfabilify  (see  lafer)  does  nof  imply  h5q)ofhesis  sfabilify 
buf  is  implied  by  if^^. 

5.  Nofice  fhaf  h5q)ofhesis  sfabilify  and  CVioo  sfabilify  imply  generalizafion. 
Since  h5q)ofhesis  sfabilify  implies  E/oo  sfabilify  if  follows  fhaf  CVioo  sfa¬ 
bilify  fogefher  wifh  hypofhesis  sfabilify  implies  EEjoo  sfabilify  (and  gen¬ 
eralizafion). 

3.4  Alternative  stability  conditions 

Our  main  result  is  in  ferms  of  CVEEE/oo  sfabilify.  There  are  however  alfernafive 
conditions  fhaf  fogefher  wifh  CVioo  sfabilify  are  also  sufficient  for  generaliza¬ 
tion  and  necessary  and  sufficient  for  consistency  of  ERM.  One  such  condition 
is  Expected-to-Leave-One-Out  Error,  in  short  EloOerr  condition. 


'^There  is  an  unfortunate  confusing  proliferation  of  definitions  of  stability.  The  hypothesis  sta¬ 
bility  of  Elisseef  and  Bousquet  is  essentially  equivalent  to  the  Li  stability  of  Kutin  and  Niyogi 
(modulo  probabilistic  versus  non-probabilistic  and  change-one  versus  leave-one-out  differences); 
similarly,  what  Kutin  and  Niyogi  call  (/3,  5)-hypothesis  stability  is  a  probabilistic  version  of  the 
(very  strong)  uniform  stability  of  Elisseef  and  Bousquet.  It  is  problematic  that  many  versions  of 
stability  exist  in  both  change-one  and  leave-one-out  forms.  If  a  given  form  of  stability  measures 
error  at  a  point  that  is  not  in  either  training  set,  the  change-one  form  implies  the  leave-one-out 
form  (for  example,  Bousquet  and  Elisseef 's  hypothesis  stability  implies  Kutin  and  Niyogi's  weak- 
L1  stability),  but  if  the  point  at  which  we  measure  is  added  to  the  training  set,  this  does  not  hold 
(for  example,  our  CVioo  stability  does  not  imply  the  change-one  CV  stability  of  Kutin  and  Niyogi; 
in  fact,  Kutin  and  Niyogi's  CV  stability  is  roughly  equivalent  to  the  combination  of  our  CVioo 
stability  and  Elisseef  and  Bousquet's  hypothesis  stability). 
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Definition  3.6  The  learning  map  L  is  EloOerr  stable  in  a  distribution-independent 
way,  if  for  each  n  there  exists  a  and  a  S^l  such  that 


Vi  G  {1, . . . ,  n}  Ps 


n 


i=l 


<  Pel  >  >  1  —  <5 


(n) 

ELI 


with  p]ffl  and  going  to  zero  for  n 


Thinking  of  the  EloOerr  property  as  a  form  of  stability  may  seem  somewhat  of 
a  stretch  (though  the  definition  depends  on  a  "perturbation"  of  the  training  set 
from  S  to  Sf.  It  may  be  justified  however  by  the  fact  that  the  EloOe^r  property 
is  implied  -  in  the  general  setting  -  by  a  classical  leave-one-out  notion  of  stabil¬ 
ity  called  hypothesis  stability^^,  which  was  introduced  by  DeVroye  and  Wagner 
[7]  and  later  used  by  [12, 4]  (and  in  a  stronger  change-one  form  by  [14]). 
Intuitively,  the  EloOerr  condition  seems  both  weak  and  strong.  It  looks  weak 
because  the  leave-one-out  error  T  V {fs* ,  zf)  seems  a  good  empirical  proxy 
for  the  expected  error  {fs,  z)  and  it  is  in  fact  routinely  used  in  this  way  for 
evaluating  empirically  the  expected  error  of  learning  algorithms. 

Definition  3.8  When  a  learning  map  L  exhibits  both  CVioo  and  EloOerr  stability,  we 
will  say  that  it  has  LOO  stability. 


3.4.1  LOO  stability  implies  generalization 


We  now  prove  that  CVioo  and  EloOerr  stability  together  are  sufficient  for  gen¬ 
eralization  for  general  learning  algorithms.  We  will  use  the  following  lemma 
mentioned  as  Remark  10  in  [4],  of  which  we  provide  a  simple  proof^^. 


Lemma  3.3  Decomposition  of  the  generalization  error 


TEs{I[fs]-Is[fs]f  <  2IEs  I[fs] 


-  )  +2MEs|P(/s,  zP-V{fsi ,zi)\. 


^®Our  definition  of  hypothesis  stability  -  which  is  equivalent  to  leave-one-out  stability  in  the  Li 
norm  -  is: 


Definition  3.7  The  learning  map  L  has  distribution-independent,  leave-one-out  hypothesis  stability  if  for 
each  n  there  exists  a  (iff 

Vp  iEsiEd|y(/s,2)-y(/s,,^)|] 
with  dff  going  to  zero  for  n  ^  oo. 

^^Bousquet  and  Elisseeff  attribute  the  result  to  Devroye  and  Wagner. 
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Proof: 

By  the  triangle  inequality  and  inspection 


IEs(/[/s]-/s[/s])"  <  2IE5  I  I[fs]  --J2nfs^,zj)  +2IEs  I  /s[/s]  - ^  Zj) 

Ti  (  ^  71  (  ^ 


i=i 


i=i 


We  now  bound  the  second  term 


^s{ls[fs]--J2v{fs^,zj) 


-  n  1 

^s\-J2nfs,z,)--j2v{fs^,z,) 


i=i 


i=i 


=  lEg- 


i=i 
1 


J2[V{fs,Zj)-V{fs,,zj)] 

1 

n 

Y}V{fs,Zi)-V{fs^,z,)] 


i=i 


i=i 


1  ” 


i=i 


=  MEs|F(/s,z,)-n/sMZi)|. 


Using  the  decomposition  of  the  generalization  error  I[fs]  —  -fs[/s]  provided  by 
the  lemma  it  is  clear  that 


Proposition  3.1  LOO  stability  implies  generalization. 


Remarks: 

1.  EloOerr  Stability  by  itself  is  not  sufficienf  for  generalizafion,  as  a  previous 
example  showed  (consider  an  algorifhm  which  refums  1  for  every  fram¬ 
ing  poinf,  and  -1  for  every  fesf  poinf.  This  algorifhm  is  EloOerr  sfable,  as 
well  as  h5q)ofhesis  sfable,  buf  does  nof  generalize). 

2.  The  converse  of  Theorem  3.1  is  false.  Considering  fhe  same  basic  sefup 
as  fhe  example  in  fhe  previous  remark,  consider  an  algorifhm  fhaf,  given 
a  fraining  sef  of  size  n,  yields  fhe  consfanf  function  f{x)  =  —1".  This 
algorifhm  possesses  none  of  CVioo  or  EloOerr  (or  Ejoo  or  EEjoo)  sfabilify, 
buf  if  will  generalize. 

3.  CVEEE/00  sfabilify  implies  convergence  of 

cause  we  can  use  fhe  decomposition  lemma  "in  reverse",  fhaf  is  (/  — 
^EUV{fs.,z,)r  <  iI-Is?  +  {Is-^EUV{fs^,z,))^andthennse 
CV loo  sfabilify  fo  bound  fhe  second  ferm. 
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We  now  turn  (see  section  3.5)  to  the  question  of  whether  CVEEE/oo  stability 
(or  LOO  stability)  is  general  enough  to  capture  the  fundamental  conditions 
for  consisfency  of  ERM  and  fhus  subsume  the  "classical"  theory.  We  will  in  facf 
show  in  fhe  nexf  subsecfion  (3.5)  fhaf  CVioo  sfabilify  alone  is  equivalenf  fo  con¬ 
sisfency  of  ERM.  To  complefe  fhe  argumenf,  we  will  also  show  in  subsecfion  3.5 
fhaf  Eioo,  EE/oo  sfabilify  (as  well  as  EloOerr)  are  implied  by  consisfency  of  ERM. 
Thus  CVEEE/oo  sfabilify  (as  well  as  LOO  sfabilify)  is  implied  by  consisfency  of 
ERM. 

3.5  CVEEE/oo  stability  is  necessary  and  sufficient  for  consis¬ 
tency  of  ERM 


We  begin  showing  fhaf  CV/oo  sfabilify  is  necessary  and  sufficienf  for  consis¬ 
fency  of  ERM. 


3.5.1  Almost  positivity  of  ERM 


We  firsf  prove  a  lemma  abouf  fhe  almost  positivity^^  of  V{fs,Zi)  —  V{fsi,Zi), 
where  [S'!  =  n,  as  usual. 

Lemma  3.4  (Almost-Positivity)  Under  the  assumption  that  ERM  finds  a  e  ^-minimizer, 
Vi  G  {!,... ,n}  V{fsi,Zi)  -V{fs,Zi)  +  2{n-  l)e^  >  0 


PROOFiBy  the  definition  of  almost  minimizer  (see  Equation  (1)),  we  have 


^  I]  ^(/sn  ^  ^  1^ Us,  zj)  >  -e 

ZjGS  ZjGS 


1 


±J2vUs>,z,)-^Y.nfs,z,)  < 


n  —  I 


^n—l 


Zj^S^  Zj^S^ 

We  can  rewrife  fhe  firsf  inequalify  as 


(15) 

(16) 


+  -  VUs*,Zi)  -  -  VUs,z^)  >  -s^. 
n  n 

The  ferm  in  fhe  brackef  is  less  fhan  or  equal  fo  (because  of  fhe  second 

inequalify)  and  fhus 

VUs^z,)  -V{fs,Zi)  >  -ns^  -  (n- 
^^Shahar  Mendelson's  comments  prompted  us  to  define  the  notion  of  almost  positivity . 


i  E  T,  v{fs..p 
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Because  the  sequence  of  ne^  is  a  decreasing  sequence  of  posifive  ferms,  we 
obfain 

V{fs*,Zi)  -V{fs,Zi)  >  -2(n-  □ 

The  following  lemma  will  be  key  in  fhe  proof  of  our  main  fheorem. 

Lemma  3.5  For  any  f  e  {1, 2, . . . ,  n},  under  almost  ERM  with  >  0  chosen  such 
that  lim„^oo  ne^  =  0,  the  following  (distribution  free)  bound  holds 

)Es[\V{fs^,Zi)  -  V{fs,z,)\]  <  )EsI[fsi]  -  IEs/s[/s]  +  4(n  -  l)s^_. 


Proof:  We  nofe  fhaf 

)Es[\V{fs^,z,)-V{fs,Zi)\]  =  )Es[\V{fsi,z,)-V{fs,Zi)  +  2{n-l)s^_,-2{n-l)e^_f] 

<  )Es[\V{fsuZi)  -  V{fs,z.)  +  2{n  -  +  2{n  - 

Now  we  make  fwo  observations.  Firsf,  under  fhe  assumpfion  of  almosf  ERM, 
by  Lemma  3.4, 

Vze{l,...,n}  y(/si,Zi)  -  y(/s,Zi) +  2(n- l)ef_i  >  0,  (17) 


and  fherefore 

)Es[\V{fsuz.)-V{fs,z,)+2{n-l)s^_f]=)Es[V{fs.,Zi)-V{fs,Zi)]+2{n-l)e^_,. 
Second,  by  fhe  linearify  of  expecfafions, 

TEs[V{fs^,z,)  -  V{fs,zi)]  =  IEs/[/si]  -  TEsIslfs],  (18) 

and  fherefore 

)Es[\V{fs^,Zi)  -  V{fs,  Zi)\]  <  )EsI[fsf  -  IEs/s[/s]  +  4(n  -  □ 


Remarks: 

1.  From  exact  positivity  if  follows  fhaf  fhe  leave-one-ouf  error  error  is  greafer 
fhan  or  equal  fo  fhe  framing  error 

1  ” 

-J2^ifs^,z.)>Is[fs]. 

n 

i—\ 

2.  CNioo  sfabilify  implies  fhaf  fhe  leave-one-ouf  error  converges  fo  fhe  frain- 
ing  error  in  probabilify. 
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3.5.2  CV loo  stability  is  necessary  and  sufficient  for  consistency  of  ERM 


In  the  next  two  theorems  we  prove  first  sufficiency  and  then  necessity. 

Theorem  3.2  If  the  map  induced  by  ERM  over  a  class  H  is  distribution  independent 
PH  stable,  and  L  is  bounded,  then  ERM  over  H  is  universally  consistent. 

PROOFiGiven  a  sample  S  =  {zi,. . . ,  Zn)  with  n  points  and  a  sample  5'„+i  = 
(zi, ,  Zn+i)  with  an  additional  point  then  by  distribution  independent  PH 
stability  of  ERM,  the  following  holds  for  all  p: 

IEs„+i  [V{fs,  z„+i)  -  ,  2„+i)]  <  [\V{fs,  z„+i)  -  ,  z„+i)  |] 

<  (/3pff)n+l,  (19) 

where  (/3p//)„+i  is  associated  with  Sn+i  and  |5'„+i|  =  n  +  1. 

The  following  holds  for  all  p: 

IEs/[/s]  -  V{fs„^„  z^+i)].  (20) 

From  Equations  (19)  and  (20),  we  therefore  have 

Vp  IEs/[/s]  <  +  (/3pp-)„+i.  (21) 

Now  we  will  show  that 


lim  sup  (IEs/[/s]  -  inf  /[/])  =  0. 

n^oo  ^  f£H 

Let  rjfi  =  inf  /[/]  under  the  distribution  p.  Clearly,  for  all  /  G  H,  we  have 
I[f]  >  Vii  and  so  IEs/[/s]  >  rj^.  Therefore,  we  have  (from  (21)) 

V/i  rj^  <  IEs/[/s]  <  [/s„+i]  +  (/lpp)n+i-  (22) 

For  every  £c  >  0,  there  exists  G  Ei  such  that  /[/scp]  <  +  Ec-  By  the 

almost  ERM  property,  we  also  have 

fs„+i  [/s„+i]  <  Ls„+i  [/ecp]  +  ^n+1 

Taking  expectations  with  respect  to  S'„+i  and  substituting  in  eq.  (22),  we  get 

Vp  <  IEs/[/s]  <  [fee, a]  +  ef+i  +  (/lpp)n+i- 

Now  we  make  the  following  observations.  First,  lim„^oo  Cn+i  =  9.  Second, 
lim„^oo(/3pp)n  =  0.  Finally,  by  considering  the  fixed  function  we  get 


n+1 


Vp  Zi)  =  I[fee,if\  <  Vl^  +  £c 


26 


Therefore,  for  every  fixed  £c  >  0,  for  n  sufficienfly  large, 

Vm  ?7^  <  IEs/[/s]  <T]f,  +  ec 
from  which  we  conclude,  for  every  fixed  Sc  >  0, 

0  <  liminf  sup  (IEs/[/s]  -  77^)  <  limsupsup  (IEs/[/s]  -  77^)  <  £c- 

yj,  n— ^00  fl 

From  fhis  if  follows  fhaf  lim„^oo  sup^(IEs/[/5]  —  77^)  =  0.  Consider  fhe  random 
variable  Xg  =  I[fs]  -  77^.  Clearly,  Xs  >  0.  Also,  lim„^oo  sup^  lEgAg  =  0. 
Therefore,  we  have  (from  Markov's  inequalify  applied  fo  Xs)'- 
For  every  a  >  0, 

lim  supP[/[/s]  >  77„  +  a]  =  lim  supP[As  >  a]  <  lim  sup  =  0. 

n—*oo  ^  n—*oo  n—*oo  q; 

This  proves  disfribufion  independenf  convergence  of  I[fs]  to  (consistency), 
given  PH  stability  □ 

Theorem  3.3  Consistency  ofERM  (over  H)  implies  PH  stability  ofERM  (over  H). 

ProOFiTo  show  PH  stability,  we  need  to  show  that 

lim  supEs[|C(/si,2;i)  -  C(/s,2*)|]  =  0 

n — »-oo  ,, 


From  Lemma  3.5, 

VpEs[|C(/g.,z.)  -  V{fs,zi)\]  <  Es/[/g.]  -Es/s[/s]  +4(n-  (23) 

Given  (universal)  consistency.  Theorem  2.1  implies  that  £  is  a  uGC  class.  Be¬ 
cause  £  is  uGC,  /[/gi]  is  close  to  Is[fs']-  Because  we  are  performing  ERM, 
Isifs*]  is  close  to  /g[/g].  Combining  these  results,  /[/gi]  -  /g[/g]  is  small. 

We  start  with  the  equality 

^s[I[fs^]  -  Isifs]]  =  IEg[/[/gi]  -  Isifs^]]  +  Ps[/s[/s^]  -  Isifs]]-  (24) 
Since  £  is  uGC,  we  have  (V/x)  with  probability  at  least  1  —  (5„(£„), 

\I{fs^]-Is[fs^]\<£n  (25) 

and  therefore 

Vp  Eg[/[/g.]  -  /g[/gi]]  <  Eg[|/[/g.]  -  /g[/gi]|]  <  £„  +  M5„(£„).  (26) 

From  Lemma  3.6,  we  have 


Vp  Eg[/g[/g.]  -  /g[/g]]  <  —  +  eti-  (27) 

n 
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Combining  Equation  (24)  with  inequalities  (26)  and  (27),  we  get 


V/i  ]E5[J[/5i]  —  /s[/s]]  <  Sn  +  M6n{Sn)  +  —  +  £„-!■ 


From  inequality  (23),  we  obtain 

V/rIEs[|C(/si,Zi)  -V{fs,Zi)\]  <  £„  +  M6n{sn)  +  ^  +  ef_i  +4(n-  l)£f_i. 

Note  that  and  e„  may  be  chosen  independently  Also,  since  we  are  guar¬ 
anteed  arbitrarily  good  £-minimizers,  we  can  choose  to  be  a  decreasing  se¬ 
quence  such  that  lim„^oo(4n  —  3)£^  =  0. 

Further,  by  Lemma  3.7,  it  is  possible  to  choose  a  sequence  £„  such  that  £„  ^  0 
and  Sn{sn)  0.  These  observations  taken  together  prove  that 

lim  supIEs[|C(/si,Zi)  -V{fs,z^)\]  =  0 

n— ^oo  ,, 

This  proves  that  universal  consistency  implies  PH  h5rpothesis  stability  □. 


Lemma  3.6  Under  almost  ERM, 

Isifs*]  -  Isifs]  <  —  +£n-l 
that  is  ERM  has  EEioo  stability. 


Proof: 


Islfs^]  = 


(n  -  l)Isi  [fsi]  +  V{fsi,Zi) 


<  +  +  (by  almost  ERM) 

n 

^  (n-  l)Jgi[/g]  +  V{fs,Zi)  -V{fs,Zi)  +  V{fsi,z,) 


n 


<  Isifs]  +  “  +  ^n-1  since  0  <V  <  M.  □ 


Lemma  3.7  If  C  is  uGC,  there  exists  a  sequence  £„  >  0  such  that: 

(1)  lim,2 — >cxD  £n  ~  0 

(2)  lim,2 — >oo  dn(Cn}  —  0- 

Proof  iBecause  £  is  uGC, 

SUpP  [  sup  |/[/]  -  /s[/]|  >  £  ]  <  (5„(£) 

M  \fen  J 

where  lim„^oo  Snis)  =  0. 
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For  every  fixed  £  we  know  that  lim„^oo  ^n(^)  =  0  for  every  fixed  integer  k. 

Let  Nk  be  such  that  for  all  n  >  Nk,  we  have  Sn(^)  <  j:-  Note,  that  for  all  i  >  j 

N,  >  Nj. 

Now  choose  the  following  sequence  for  £„.  We  take  £„  =  1  for  all  n  <  N2; 

£n  =  5  for  N2  <  n  <  -/V3  and  in  general  Sn  =  ^  for  all  Nk<n<  Nk+i. 

Clearly  £„  is  a  decreasing  sequence  converging  to  0.  Further,  for  all  Nk  <  n  < 

Nk+i,  we  have 

^n(^n)  — 

Clearly  (5„(£„)  also  converges  to  O.D 
Remarks: 

1.  Convergence  of  the  empirical  error  to  the  expected  error  follows  from 
either  CVEEEjoo  or  LOO  stability  without  assuming  ERM  (see  Theorem 
3.1  and  Proposition  3.1). 

2.  In  general  the  bounds  above  are  not  exponential  in  5.  However,  since  for 
ERM  CVioo  stability  implies  that  £  is  uGC,  the  standard  uniform  bound 
holds,  which  for  any  given  e  is  exponential  in  <5 


sup  P  jsup  |/[/]  -  /s[/]|  >  sj  <  CAf  e  . 

Notice  that  the  covering  number  can  grow  arbitrarily  fast  in  1  resulting 
in  an  arbitrarily  slow  rate  of  convergence  between  Is[f]  and  /[/]. 

Pseudostability:  a  remark 

It  is  possible  to  define  a  one-sided  version  of  PH  stability,  called  here  pseudoPH 
stability: 

Definition  3.9  The  learning  map  L  has  distribution-independent,  pseudo  pointwise 
hypothesis  stability  if  for  each  n  there  exists  a 

VzG{l,...,n}  Vp  ^E.s[V{fs>,z^) -V{fs,Zi)]< 
with  going  to  zero  for  n  ^  00. 

PseudoPH  stability  is  also  necessary  and  sufficient  for  universal  consistency  of  ERM. 
PseudoPH  stability  is  weaker  than  PH  stability.  The  proof  of  its  equivalence  to 
consistency  of  ERM  is  immediate,  following  directly  from  its  definition.  How¬ 
ever,  for  general  (non-ERM)  algorithms  pseudoPH  stability  is  not  sufficient  in 
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our  approach  to  ensure  convergence  in  probability  of  the  empirical  to  the  ex¬ 
pected  risk  (eg  generalization),  when  combined  with  E/oo  and  EE/oo  (or  EloOerr) 
stability.^^. 

Theorems  3.2  and  3.3  can  be  stated  together  with  the  remark  on  pseudoPH 
stability  to  yield 

Theorem  3.4  Either  PH  or  pseudoPH  stability  of  ERM  (over  H)  is  necessary  and 
sufficient  for  consistency  of  ERM  (over  H). 

If  we  make  specific  assumpfions  on  fhe  loss  function  V  (see  a  previous  foof- 
nofe),  fhen  fhe  above  fheorem  can  be  sfafed  in  ferms  of  H  being  uGC. 

A  short  summary  of  the  argument 

The  proof  jusf  given  of  necessify  and  sufficiency  of  CVioo  sfabilify  for  consis- 
fency  of  ERM  has  a  simple  sfrucfure,  despife  fhe  fechnical  defails.  We  sum¬ 
marize  if  here  in  fhe  special  case  of  exacf  minimizafion  of  fhe  empirical  risk 
and  exisfence  of  fhe  minima  of  fhe  frue  risk  (which  we  do  nof  assume  in  fhe 
full  proof  in  fhe  previous  section)  fo  expose  fhe  essence  of  fhe  proof^^.  In  fhis 
shorf  summary,  we  only  show  fhaf  CVjoo  sfabilify  (as  well  as  pseudoPH  sfabil¬ 
ify)  is  necessary  and  sufficienf  for  consisfency  of  ERM;  because  ERM  on  a  uGC 
class  is  always  Eioo,  EEjoo  (and  EloOe^r)  sfable,  fhe  necessify  and  sufficiency  of 
CVEEEjoo  (and  LOO)  sfabilify  follows  direcfly. 

Theorem  3.5  Under  exact  minimization  of  empirical  of  the  empirical  risk  and  exis¬ 
tence  of  the  minima  of  the  true  risk,  distribution  independent  (/3, 5)  CVioo  is  equivalent 
to  the  convergence  I[fs]  I[f*]  in  probability,  where  f*  G  argmin/g-H  /[/]. 

PROOF:By  fhe  assumpfion  of  exacf  ERM,  posifivify  (insfead  of  almosf  posifiv- 
ify,  see  Lemma  3.4)  holds,  fhaf  is 


V{fsi,zf-V{fs,zi)>0. 


Then  fhe  following  equivalences  hold: 
(/3,  S)  CVioo  sfabilify  lim 

n — »-oo 

lim 

n — >-oo 

lim 


(Es[\V{fsi,zf-V{fs,Zi)\]  =  0, 
TEs[V{fsi,Zi)-V{fs,zf]  =  0, 

TEsI[fsi]-TEsIs[fs]  =  0, 


n — »-oo 

lim  JEsIifsf  =  lim  IEs/s[/s]. 


n — >-oo 


n — »-oo 


^^With  pseudoPH  stability  alone,  we  are  unable  to  bound  the  second  term  in  the  decomposition 
of  Lemma  3.3. 

^^This  short  version  of  the  proof  could  be  made  shorter  by  referring  to  known  results  on  ERM. 
The  argument  for  almost  ERM  can  be  made  along  similar  lines. 
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Now,  /[/*]  <  I[fsi]  and  Is[fs]  <  Is[f*]-  Therefore, 

/[/*]  <  lim  IEs/[/s.]  =  lim  lEsIslfs]  <  Hm  Eg/s [/I  = /[/*], 

n— ^oo  n— ^oo  n — >-oo 

resulting  in 

lim  Es/[/s.]  =  lim  Es/[/*]  =  /[/*], 

n — »-oo  n — >-oo 

which  implies  that  in  probability, 

lim  /[/s.]  =/[/*]. 

n— ^oo 


The  last  step  is  to  show  that 


lim  Es/[/s.]  =/[/*],  (28) 

n — >-oo 

is  equivalent  to  the  statement  that  Ilfs*]  — *■  I[f*]  in  probability. 

Since  0  <  /[/]  <  M  for  all  /,  convergence  in  probability  implies  equation  (28). 
The  other  direction  of  the  statement  follows  from  the  fact  that 

nfs^]  -  nn  >  o 

because  of  the  definition  of  f* .  Therefore, 

Es[/[/s^] -/[/*]]  >0,  (29) 

which,  together  with  equation  (28),  implies  that  in  probability 

lun  I[fsi]  =  nn  (30) 

n— ^oo 

Finally  we  note  that  the  convergence  in  probability  of  /[/s*]  to  /[/*]  is  equiv¬ 
alent  to  consistency.  If  the  draw  Sd  of  the  training  set  has  n  -I-  1  elements,  the 
convergence  of  Ilfs*  ]  to  /[/*]  in  probability  is  equivalent  to  the  convergence  of 
I[fs]  to  /[/*]  in  probability.  □ 


3.5.3  Consistency  of  ERM  implies  E/oo  and  EE/oo  (and  also  EloOerr)  stability 

ERM  is  EE/oo  stable  (even  when  H  is  not  a  uGC  class)  as  shown  by  Lemma  3.6. 
Consistency  of  ERM  implies  E/oo  stability  as  shown  by  the  following  Lemma: 

Lemma  3.8  If  almost  ERM  is  consistent  then 

lim  Es|/[/si]-/[/s]|  =0 

n— ^oo 
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The  lemma  follows  immediately  from  the  following  condition  (because  of  Jensen 
inequality) 


^s\I[fs]  -  I[fsA\  <  IEs|/[/s]  -  Is[fs]\  +  IEs|/s[/s]  -  /[/s^]| 

<  Es|/[/s]  -  Is[fs]\  +  IEs|/s[/s^]  -  /[/s^]|  +  IEs|/s[/s^]  -  Is[fs]\ 


which  implies  Ejoo  stability  (using  consistency  to  bound  the  first  two  terms  in 
the  right  hand  side  and  EE/oo  stability  to  bound  the  last  term).  Thus  consistency 
ofERM  implies  CVEEEioo  stability. 

We  now  show  that  consistency  of  ERM  implies  Elooerr  stability. 

Lemma  3.9  ERM  on  a  uGC  class  implies 

/  n 

\  ^  i=l 

where  lim„^oo  Pn  =  0. 


Proof: 

By  the  triangle  inequality  and  inspection 

Es  (^/[/s]  -  <  2IEs(/[/s]  -/s[/s]))V2IEs  (^Is[fs]  - 

We  first  bound  the  first  term.  Since  we  have  are  performing  ERM  on  a  uGC 
class  we  have  with  probability  1  —  (5i 

\Is[fs]-I[fs]\<Pi- 


Therefore, 

Es  {I[fs]  -  Is[fs])f  <  M/3i  +  M^Si. 

The  following  inequality  holds  for  the  second  term  (see  proof  of  Lemma  3.3) 

Es  <  M]Es\V{fs,zP-V{fsuZi)\. 

Since  ERM  is  on  a  uGC  class  (d2,  ^2)  CVioo  stability  holds,  implying 

MTEs \V{fs,  Zi)  -  V{fsi,zP\  <  MP2  +  MH2- 


Therefore  we  obtain 


Es  Isifs] 


—  G(/5i ,  Zj)  I  <  M/32  +  M^(52 
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leading  to 


Es  y[fs]-l^^V{fsi,z,)j  <2Ml3i  +  2M^5i  +  2MP2  +  2M^2- 

□ 


3.5.4  Main  result 

We  are  now  ready  to  state  the  main  result  of  section  3.5. 

Theorem  3.6  Assume  that  fs,  fs*  G  'H  are  provided  by  ERM  and  L  is  bounded. 
Then  distribution  independent  CVEEEioo  stability  (as  well  as  LOO  stability)  is  neces¬ 
sary  and  sufficient  for  consistency  of  ERM.  Therefore,  the  following  are  equivalent 

a)  the  map  induced  by  ERM  is  distribution  independent  CVEEEioo  stable 
a')  the  map  induced  by  ERM  is  distribution  independent  LOO  stable 

b)  almost  ERM  is  universally  consistent. 

c)  C  is  uGC 

PROOFiThe  equivalence  of  (b)  and  (c)  is  well-known  (see  Theorem  2.1).  We 
showed  fhaf  CVioo  sfabilify  is  equivalenf  fo  PH  sfabilify  and  fhaf  PH  sfabilify 
implies  (b).  We  have  also  shown  in  fhaf  almosf  ERM  exhibifs  CVEEE/oo  sfabil¬ 
ify  (and  fhaf  almosf  ERM  exhibifs  LOO  sfabilify).  The  fheorem  follows.  □ 

Remark: 

1.  In  fhe  classical  liferafure  on  generalizafion  properfies  of  local  classifi¬ 
cation  rules  ([7])  h5q)ofhesis  sfabilify  was  proven  (and  used)  fo  imply 
EloOerr  Sfabilify.  If  is  fhus  nafural  fo  ask  whefher  we  could  replace  EloOerr 
sfabilify  wifh  h5q)ofhesis  sfabilify  in  fheorem  3.6.  Unforfunafely,  we  have 
been  unable  fo  eifher  prove  fhaf  ERM  on  a  uGC  class  has  h5rpofhesis  sfa¬ 
bilify  or  provide  a  counferexample.  The  quesfion  remains  fherefore  open. 
If  is  known  fhaf  ERM  on  a  uGC  class  has  hypothesis  stability  when  either  a) 
H  is  convex,  or  b)  the  setting  is  realizable^^ ,  or  c)  H  has  a  finite  number  of 
hypotheses. 

3.5.5  Distribution-dependent  stability  and  consistency 

Our  main  result  is  given  in  terms  of  disfribufion-free  sfabilify  and  disfribufion- 
free  consisfency.  In  fhis  disfribufion-free  framework  consisfency  of  ERM  is 
equivalenf  fo  £  being  uGC.  Inspecfion  of  fhe  proof  suggesfs  fhaf  if  may  be 
possible  fo  reformulafe  our  fheorem  (see  also  3.5)  in  a  disfribufion  dependenf 
way:  for  ERM,  CVioo  stability  with  respect  to  a  specific  distribution  is  necessary  and 
sufficient  for  consistency  with  respect  to  the  same  distribution.^^  Of  course,  in  fhis 

25\Ye  say  that  the  setting  is  realizable  when  there  is  some  /o  S  W  which  is  consistent  with  the 
examples. 

^^It  should  be  possible  to  reformulate  the  definitions  of  distribution-dependent  consistency  and 
CVioo  stability  appropriately,  to  avoid  the  case  of  trivial  consistency  (following  [25]). 
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distribution-dependent  framework  £  may  not  be  uGC. 


4  Stability  conditions,  convergence  rates  and  size  of 
uGC  classes 

The  previous  section  concludes  the  main  body  of  fhe  paper.  This  section  con- 
sisfs  of  a  few  "side"  observations.  If  is  possible  fo  provide  rafes  of  convergence 
of  fhe  empirical  risk  fo  fhe  expecfed  risk  as  a  function  of  CVioo  sfabilify  using 
fheorem  3.2.  In  general  fhese  rafes  will  be  very  slow,  also  in  fhe  case  of  ERM. 
In  fhis  section  we  oufline  how  CVjoo  sfabilify  can  be  used  fo  confrol  fhe  expec- 
fafion  and  error  stability  can  be  used  fo  confrol  fhe  variance.  The  fwo  nofions  of 
sfabilify  fogefher  will  be  called  strong  stability  when  fhe  rafe  of  convergence  of 
error  sfabilify  is  fasf  enough.  Sfrong  sfabilify  yields  fasfer  rafes  of  convergence 
of  fhe  empirical  error  fo  fhe  expecfed  error.  In  fhis  section  we  define  sfrong 
sfabilify  and  lisf  several  "small"  h5rpofhesis  spaces  for  which  ERM  is  sfrongly 
sfable. 

The  following  definifion  of  fhe  confinuify  of  fhe  learning  map  L  is  based  upon 
a  variafion  of  fwo  definifions  of  sfabilify  firsf  infroduced  in  [14]. 

Definition  4.1  The  learning  map  L  is  strongly  stable  if 

a.  it  has  {/Sioo,  Sioo)  CVioo  stability 

b.  it  has  error  stability  with  a  fast  rate,  eg  for  each  n  there  exists  a  pirlor  and  a  Sirlor 
such  that 


Vf  e  {1,  ...,n}  Vp  IPs  {\I[fs]  -  I[fsi]\  <  Perror}  >1^  -  Serror, 

where  Perror  =  where  a>  1/2  and  derror  = 

Our  definition  of  sfrong  sfabilify  depends  on  CVioo  sfabilify  and  on  fhe  differ¬ 
ence  in  fhe  expecfed  values  of  fhe  losses  {I[fs]  —  I[fsP)- 
The  following  fheorem  is  similar  fo  fheorem  6.17  in  [14]. 

Theorem  4.1  If  the  learning  map  is  strongly  stable  then,  for  any  £  >  0, 

IPs  {|f^s[/s]  ~  -^[/s]!  >  £  +  Ploo  +  MSloo  +  Perror  +  MSgrror}  C: 

2  /  P _ -s'^n  \  n{n  +  l)2Mderror\ 

\  ^  \  8{2n Perror  +  )  2nPerror  +  M  ) 

where  M  is  a  bound  on  the  loss. 

The  above  bound  sfafes  fhaf  wifh  high  probabilify  fhe  empirical  risk  converges 
fo  fhe  expecfed  risk  af  fhe  rafe  of  fhe  slower  of  fhe  fwo  rafes  Pioo  and  Perror-  The 
probabilify  of  fhe  lack  of  convergence  decreases  exponenfially  as  n  increases. 
The  proof  of  fhe  above  fheorem  is  in  Appendix  6.2  and  is  based  on  a  version  of 
McDiarmid's  inequalify. 
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For  the  empirical  risk  to  converge  to  the  expected  risk  in  the  above  bound 
Perror  must  decrease  strictly  faster  than  For  ERM  the  rate  of  con¬ 

vergence  of  Perror  IS  the  Same  rate  as  the  convergence  of  fhe  empirical  error  fo 
fhe  expecfed  error. 

Error  sfabilify  wifh  a  fasf  rafe  of  convergence  is  a  sfrong  requiremenf.  In  gen¬ 
eral,  for  a  uGC  class  fhe  rafe  of  convergence  of  error  sfabilify  can  be  arbifrarily 
slow  because  fhe  covering  number  associafed  wifh  fhe  function  class  can  grow 
arbifrarily  fasf  wifh  Even  for  h5rpofhesis  spaces  wifh  VC  dimension  d 
fhe  rafe  of  convergence  of  error  sfabilify  is  nof  fasf  enough,  wifh  probabilify 

Easf  rafes  for  error  sfabilify  can  be  achieved  for  ERM  wifh  cerfain  h5rpofhesis 
spaces  and  sef tings: 

•  ERM  on  VC  classes  of  indicafor  functions  in  fhe  realizable  setting 

•  ERM  wifh  square  loss  function  on  balls  in  Sobolev  spaces  H^{X),  wifh 
compacf  X  c  IR'^,  if  s  >  d  (fhis  is  due  fo  Proposifion  6  in  [5]); 

•  ERM  wifh  square  loss  function  on  balls  or  in  RKHS  spaces  wifh  a  kernel 
K  which  is  wifh  s  >  d  (fhis  is  can  be  inferred  from  [26]); 

•  ERM  on  VC-subgraph  classes  fhaf  are  convex  wifh  fhe  square  loss. 


A  requiremenf  for  fasf  rafes  of  error  sfabilify  is  fhaf  fhe  class  of  funcfions  H 
is  "small":  h5rpofhesis  spaces  wifh  wifh  empirical  covering  numbers  JV{'H,  e) 
fhaf  are  polynomial  in  (VC  classes  fall  info  fhis  cafegory)  or  exponential 
in  in  £“P  wifh  p  <  \  (fhe  Sobolev  spaces  and  RKHS  spaces  fall  info  fhis  caf¬ 
egory).  Simply  having  a  "small"  function  class  is  nof  enough  for  fasf  rafes: 
added  requiremenfs  such  as  eifher  fhe  realizable  setting  or  assumpfions  on  fhe 
convexify  of  Ti  and  square  loss  are  needed. 

There  are  many  sifuafions  where  convergence  of  fhe  empirical  risk  fo  fhe  ex¬ 
pecfed  risk  can  have  rafes  of  fhe  order  oiO  (  using  sfandard  VC  or  cover¬ 


ing  number  bounds,  here  d  is  fhe  mefric  enfropy  or  shaffering  dimension  of  fhe 
class  Ti.  Eor  fhese  cases  we  do  nof  have  sfabilify  based  bounds  fhaf  allow  us 
fo  prove  rafes  of  convergence  of  fhe  empirical  error  fo  fhe  expecfed  error  fasfer 


^^Take  a  compact  set  K  of  continuous  functions  in  the  sup  norm,  so  that  N{e,K)  is  finite  for  all 
£  >  0.  The  set  is  uniform  Glivenko-Cantelli.  N{e,  K)  can  go  to  infinity  arbitrarily  fast  as  £  ^  0  in 
the  sup  norm  (Dudley,  pers.  com.). 

^®This  case  was  considered  in  [14]  theorem  7.4 
Theorem:  Let  Hhe  a  space  of  ±l-classifiers.  The  following  are  equivalent 

1.  There  is  a  constant  K  such  that  for  any  distribution  A  on  Z  and  any  fo  g  Ti,  ERM  over  Ti  is 
(0,  e~^")  CV stable  with  respect  to  the  distribution  on  Z  generated  by  A  and  fo- 

2.  The  VC  dimension  ofTi  is  finite. 


35 


than  the  pol5momial  bound  in  theorem  3.1  which  gives  suboptimal  rates  that 

are  much  slower  than  O  ^  •  The  following  cases  fall  info  fhe  gap  befween 

general  uGC  classes  fhaf  have  slow  rafes  of  convergence^^  and  fhose  classes 
fhaf  have  a  fasf  rafe  of  convergence^®: 

•  ERM  on  convex  hulls  of  VC  classes. 


•  ERM  on  balls  in  Sobolev  spaces  H‘‘{X)  if  2s  >  d,  which  is  fhe  condifion 
fhaf  ensures  fhaf  functions  in  fhe  space  are  defined  poinfwise  -  a  neces¬ 
sary  requiremenf  for  learning.  In  fhis  case  fhe  sfandard  union  bounds 
give  rafes  of  convergence  n((T)^):  for  fhe  general  case  6=1/4  and  for 
fhe  convex  case  b=  1/3. 

•  ERM  on  VC  classes  of  indicafor  functions  in  fhe  non-realizable  setting. 


5  Discussion 

The  resulfs  of  fhis  paper  are  inferesfing  from  two  quite  different  points  of  view. 
Erom  fhe  poinf  of  view  {A)  of  fhe  foundafions  of  learning  fheory,  fhey  provide 
a  condifion  -  CVEEE/oo  sfabilify  -  fhaf  exfends  fhe  classical  conditions  beyond 
ERM  and  subsumes  fhem  in  fhe  case  of  ERM.  Erom  fhe  poinf  of  view  (B)  of 
inverse  problems,  our  resulfs  show  fhaf  fhe  condifions  of  well-posedness  of  fhe 
algorifhm  (specifically  sfabilify),  and  fhe  condifion  of  predicfivify  (specifically 
generalization)  fhaf  played  a  key  buf  independenf  role  in  fhe  developmenf  of 
learning  fheory  and  learning  algorifhms  respectively,  are  in  facf  closely  relafed: 
well-posedness  (defined  in  ferms  of  CVEEEjoo  sfabilify)  implies  predicfivify 
and  if  is  equivalenf  fo  if  for  ERM  algorifhms. 

•  (A):  Learning  fechniques  sfarf  from  fhe  basic  and  old  problem  of  tiffing 
a  mulfivariafe  funcfion  fo  measuremenf  dafa.  The  characferisfic  feafure 
cenfral  fo  fhe  learning  framework  is  fhaf  fhe  tiffing  should  be  predictive, 
in  fhe  same  way  fhaf  cleverly  tiffing  dafa  from  an  experimenf  in  physics 
can  uncover  fhe  underlying  physical  law,  which  should  fhen  be  usable  in 
a  predicfive  way.  In  fhis  sense,  fhe  same  generalizafion  resulfs  of  learn¬ 
ing  fheory  also  characferize  fhe  condifions  under  which  predicfive  and 
fherefore  scienfific  "fheories"  can  be  exfracfed  from  empirical  dafa  (see 
[25]).  If  is  surprising  fhaf  a  form  of  sfabilify  fums  ouf  fo  play  such  a  key 
role  in  learning  fheory.  If  is  somewhaf  infuifive  fhaf  sfable  solutions  are 
predicfive  buf  if  is  surprising  fhaf  our  specific  definition  of  CVioo  sfabil¬ 
ify  fully  subsumes  fhe  classical  necessary  and  sufficient  condifions  on  H  for 
consisfency  of  ERM. 

Obtained  using  either  standard  covering  number  bounds  or  theorem  3.1. 

®®Obtained  using  either  standard  covering  number  bounds  or  strong  stability. 
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CVEEE/oo  (or  LOO)  stability  and  its  properties  may  suggest  how  to  de¬ 
velop  learning  theory  beyond  the  ERM  approach.  It  is  a  simple  observa¬ 
tion  that  CVEEEjoo  (or  LOO)  stability  can  provide  generalization  bounds 
for  algorithms  other  than  ERM.  Eor  some  of  fhem  a  "VC-sfyle"  analysis 
in  ferms  of  complexify  of  fhe  h5^ofhesis  space  can  sfill  be  used;  for  ofh- 
ers,  such  as  k-Nearesf  Neighbor,  such  an  analysis  is  impossible  because 
fhe  h5^ofhesis  space  has  unbounded  complexify  or  is  nof  even  defined, 
whereas  CVioo  sfabilify  can  sfill  be  used. 

•  (B):  Well-posedness  and,  in  parficular,  sfabilify  are  af  fhe  core  of  fhe  sfudy 
of  inverse  problems  and  of  fhe  fechniques  for  solving  fhem.  The  nofion 
of  CV loo  sfabilify  may  be  a  fool  fo  bridge  learning  fheory  and  fhe  broad 
research  area  of  fhe  sfudy  of  inverse  problems  in  applied  mafh  and  en¬ 
gineering  (for  a  review  see  [10]).  As  we  menfioned  in  fhe  infroducfion, 
while  predicfivify  is  af  fhe  core  of  classical  learning  fheory,  anofher  mo- 
fivafion  drove  fhe  developmenf  of  several  of  fhe  besf  existing  algorifhms 
(such  as  regularization  algorifhms  of  which  SVMs  are  a  special  case): 
well-posedness  and,  in  parficular,  sfabilify  of  fhe  solufion.  These  fwo 
requiremenfs  -  consisfency  and  sfabilify  -  have  been  freafed  so  far  as  "de 
facto"  separafe  and  in  facf  fhere  was  no  a  priori  reason  fo  believe  fhaf  fhey 
are  related  (see  [20]).  Our  new  resulf  shows  fhaf  fhese  fwo  apparenfly  dif- 
terenf  mofivafions  are  closely  relafed  and  acfually  complefely  equivalenf 
for  ERM. 

Some  additional  remarks  and  open  questions  are: 

1.  If  would  be  inferesfing  fo  analyze  CVEEE/oo  arid  LOO  sfabilify  proper¬ 
ties  -  and  fhereby  estimate  bounds  on  rafe  of  generalization  -  of  several 
non-ERM  algorifhms.  Several  observations  can  be  already  inferred  from 
exisfing  resulfs.  Eor  insfance,  fhe  resulfs  of  [3]  imply  fhaf  regularizafion 
and  SVMs  are  CVEEE/oo  (and  also  LOO)  sfable;  a  version  of  bagging  wifh 
fhe  number  k  of  regressors  increasing  wifh  n  (wifh  ^  ^  0)  is  CVioo  sfa¬ 
ble  and  has  h5q)ofhesis  sfabilify  (because  of  [7])  and  EE/oo  sfabilify  and  is 
fhus  CVEEE/oo  (and  LOO)  sfable;  similarly  k-NN  wifh  k  ^  oo  and  ^  ^  0 
and  kernel  rules  wifh  fhe  widfh  ^  0  and  ft.„n  — >  oo  are  CVEEE/oo  (and 
LOO)  sfable.  Thus  all  fhese  algorifhms  satisfy  Theorem  3.1  and  Proposi¬ 
tion  and  3.1  and  have  fhe  generalization  properly,  fhaf  is  Is[fs]  converges 
fo  I[fs]  (and  some  are  also  universally  consisfenf). 

2.  The  rafe  of  convergence  of  fhe  empirical  error  fo  fhe  expecfed  error  for 
fhe  empirical  minimizer  for  cerfain  h5rpofhesis  spaces  differ,  depending 
on  whefher  we  use  fhe  sfabilify  approaches  or  measures  of  fhe  complex¬ 
ify  of  fhe  h5q)ofhesis  space,  for  example  VC  dimension  or  covering  num¬ 
bers.  This  discrepancy  is  illusfrafed  by  fhe  following  fwo  gaps. 


(a)  The  h5q)ofhesis  spaces  in  section  4  fhaf  have  a  fasf  rafe  of  error  sfabil¬ 
ify  have  a  rafe  of  convergence  of  fhe  empirical  error  of  fhe  minimizer 
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to  the  expected  error  at  a  rate  of  O  (^),  where  d  is  the  VC  dimension 
or  metric  entropy.  This  rate  is  obtained  using  VC-t5rpe  bounds.  The 
strong  stability  approach,  which  uses  a  variation  of  McDiarmid's  in- 
equalify  gives  a  rafe  of  convergence  of  If  may  be  possible 

fo  improve  fhese  rafes  using  inequalifies  of  fhe  fype  in  [19]. 

(b)  For  fhe  h5q)ofhesis  spaces  described  af  fhe  end  of  section  4  sfandard 
martingale  inequalifies  cannof  be  used  fo  prove  convergence  of  fhe 
empirical  error  fhe  expecfed  error  for  fhe  empirical  minimizer. 


If  is  known  fhaf  martingale  inequalifies  do  nof  seem  fo  yield  resulfs  of 
opfimal  order  in  many  sifuafions  (see  [22]).  A  basic  problem  in  fhe  mar¬ 
tingale  inequalifies  is  how  variance  is  confrolled.  Given  a  random  vari¬ 
able  Z  =  Xn)  fhe  variance  of  fhis  random  variable  is  confrolled 

by  a  ferm  of  fhe  form  of 


Var(Z)  <  IE 


where  Z^^'>  =  f{Xi, X[,  ...A„).  If  we  sef  Z  =  Is[fs]  -  I[fs]  then  for  a 
function  class  with  VC  dimension  d  the  upper  bound  on  the  variance  is  a 
constant  since 

IE[(Z-ZW)2]  =  k^. 

However,  for  this  class  of  functions  we  know  that 


V^r{Is[fs]-I[fs])=e(^\[^y 

It  is  an  open  question  if  some  other  concentration  inequality  can  be  used 
to  recover  optimal  rates. 

3.  We  have  a  direct  proof  of  the  following  statement  for  ERM:  If  Ti  has  infi¬ 
nite  VC  dimension,  then  Vn,  {f3pH)n  >  This  shows  that  distribution-free 
flpH  does  not  converge  to  zero  if  Ti  has  infinite  VC  dimension  and  there¬ 
fore  provides  a  direct  link  between  VC  and  CVjoo  stability  (instead  of  via 
consistency). 

4.  Our  results  say  that  for  ERM,  distribution-independent  CVioo  stability 
is  equivalent  to  the  uGC  property  of  £.  What  can  we  say  about  com¬ 
pactness?  Compactness  is  a  stronger  constraint  on  £  than  uGC  (since 
compact  spaces  are  uGC  but  not  vice  versa).  Notice  that  the  compactness 
case  is  fundamentally  different  because  a  compact  Ti  is  a  metric  space, 
whereas  in  our  main  theorem  we  work  with  spaces  irrespectively  of  their 
topology.  The  specific  question  we  ask  is  whether  there  exists  a  stability 
condition  that  is  related  to  compactness  -  as  CVioo  stability  is  related  to 
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the  uGC  property.  Bousquet  and  Elisseeff  showed  that  Tikhonov  regular¬ 
ization  (which  enforces  compactness  but  is  NOT  empirical  risk  minimiza¬ 
tion)  gives  uniform  sfabilify  (wifh  fasf  rafe).  Kufin  and  Niyogi  showed 
fhaf  Bousquef  and  Elisseeff 's  uniform  sfabilify  is  unreasonably  sfrong  for 
ERM  and  infroduced  fhe  weaker  nofion  of  (/3,  (f)-h5q)ofhesis  sfabilify  in 
equafion  14  .  If  should  also  be  nofed  (observafion  by  Sieve  Smale)  fhaf 
bofh  fhese  definifions  of  sfabilify  effectively  require  a  h5q)ofhesis  space 
wifh  fhe  sup  norm  topology  The  following  fheorems  illusfrafe  some  re- 
lafions.  Eor  fhese  fheorems  we  assume  fhaf  fhe  h5q)ofhesis  space  is  a 
bounded  subsef  of  C{X)  where  X  is  a  closed,  compacf  subsef  X  e 
and  y  is  a  closed  subsef  F  e  M. 

Theorem  5.1  Given  (l3,S)-hypothesis  stability  for  ERM  with  the  square  loss, 
the  hypothesis  space  H  is  compact. 

Theorem  5.2  If  H  is  compact  and  convex  then  ERM  with  the  square  loss  is 
((3,  S)-hypothesis  stable  under  regularity  conditions  of  the  underlying  measure. 

The  proof  is  skefched  in  fhe  Appendix  6.3.  The  fheorems  are  nof  sym- 
mefric,  since  fhe  second  requires  convexify  and  consfrainfs  on  fhe  mea¬ 
sure.  Thus  fhey  do  nof  answer  in  a  safisfacfory  way  fhe  quesfion  we 
posed  abouf  compacfness  and  sfabilify.  In  facf  if  can  be  argued  on  gen¬ 
eral  grounds  fhaf  compacfness  is  nof  an  appropriafe  properly  to  consider 
in  connecfion  wifh  h5rpofhesis  sfabilify  (Mendelson,  pers.  com.). 

Einally,  fhe  search  for  "simpler"  conditions  fhan  CVEEE/oo  sfabilify  is  open.  Ei- 
fher  CVEEE/oo  or  LOO  sfabilify  answer  all  fhe  requiremenfs  we  need:  each  one 
is  sufficienf  for  generalization  in  fhe  general  setting  and  subsumes  fhe  classical 
fheory  for  ERM,  since  if  is  equivalenf  to  consisfency  of  ERM.  If  is  quife  possible, 
however,  fhaf  CVEEE/oo  sfabilify  may  be  equivalenf  fo  ofher,  even  "simpler" 
conditions.  In  particular,  we  conjecfure  fhaf  CV/oo  and  EE/oo  sfabilify  are  suffi¬ 
cienf  for  generalization  for  general  algorifhms  (wifhouf  E/oo  sfabilify).  Alfema- 
fively,  if  may  be  possible  fo  combine  CV/oo  sfabilify  wifh  a  "sfrong"  condition 
such  as  h5q)ofhesis  sfabilify.  We  know  fhaf  h5q)ofhesis  sfabilify  implies  EloOerr 
sfabilify;  we  do  nof  know  whefher  or  nof  ERM  on  a  uCC  class  implies  h5rpofh- 
esis  sfabilify,  fhough  we  conjecfure  fhaf  if  does. 

The  diagram  of  Eigure  1  shows  relafions  between  fhe  various  properties  dis¬ 
cussed  in  fhis  paper.  The  diagram  is  by  if  self  a  map  fo  a  few  open  questions. 
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6  Appendices 

6.1  Some  conditions  that  are  necessary  and  sufficient  for  the 
uGC  property 

Alon  et  al.  proved  a  necessary  and  sufficient  conditions  for  universal  (wrt 
all  distributions)  and  uniform  (over  all  functions  in  the  class)  convergence  of 
\Is[f]  —  I[f]\>  in  terms  of  the  finiteness  for  all  7  >  0  of  a  combinatorial  quan¬ 
tity  called  Vy  dimension  of  T  (which  is  the  set  V{x),f{x),f  G  H),  under  some 
assumptions  (such  as  convexity,  continuity,  Lipschitz)  on  V. 

Alon's  result  is  based  on  a  necessary  and  sufficient  (distribution  independent) 
condition  proved  by  Vapnik  and  Dudley  et  al.  which  uses  the  (distribution- 
independent)  metric  entropy  of  JF  defined  as  Hn{e,  T)  =  logA/’(e,  a;„), 

where  T,  x„)  is  the  e-covering  of  T  with  respect  to  1“  ( is  the  dis¬ 
tance  on  the  points  a;„). 

Theorem  (Dudley  et  al.)  T  is  a  strong  uniform  Glivenko-Cantelli  class  i^lim„^oo 
Ofor  all  e  >  0. 

Notice  (see  [16])  that  the  metric  entropy  may  be  defined  (and  used  in  the  above 
theorem)  with  respect  to  empirical  norms  other  than  1“  . 

Thus  the  following  equivalences  hold: 


H  is  uGC  lim  — -  =  0, 

n — »^oo  71 

finite  1/y  V  7  >  0. 

Finite  VC  dimension  is  the  special  case  of  the  latter  condition  when  the  func¬ 
tions  in  H  are  binary.  It  is  well  known  that  necessary  and  sufficient  condition 
for  uniform  convergence  in  the  case  of  0  —  1  functions  is  finiteness  of  the  VC 
dimension  (Vapnik).  Notice  that  the  Vy  dimension  exactly  reduces  to  the  VC 
dimension  for  7  =  0. 


6.2  Strong  stability  implies  good  convergence  rates 

Proof  of  Theorem  4.1 

This  theorem  is  a  variation,  using  our  definition  of  L-stability,  of  Theorem  6.17 
in  [14].  We  first  give  two  definitions  from  [14]. 

Definition  6.1  Change-one  cross-validation  (CVco)  stability  is  defined  as 

IPs.z  {\V{fs,z)  -  V{fsi.^,z)\  <  (3cv}  >  1  -  Scv, 

where  (3cv  =  and  6cv  =  0{n~'^)  for  r,  a  >  0. 
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Definition  6.2  Error  stability  is  defined  as 


TPs,A\I[fs]-I[fs^.^]\<PError}  —  ^  ^Errori 

where  PAror  =  0{n~°‘)  where  a>  1/2  and  6Aror  = 


Notice  that  both  these  definitions  perturb  the  data  by  replacement.  In  our  def¬ 
inition  of  CVioo  sfabilify  we  use  fhe  leave-one-ouf  procedure  fo  perfurb  fhe 
framing  sef. 

Kutin  and  Niyogi  2002  -  Theorem  6.17:  If  the  learning  map  has  CVco  stability 
{Pcv,  Scv),  ^nd  error  stability  ^Error)  ('^^ere  error-stability  is  defined  with 

respect  to  a  point  being  replaced) ),  then,  for  any  e  >  0, 


TPsiMfs]  -  I[fs]\  >  e  +  Pcv  +  MScv}  < 


exp 


Mnf 'Error 

where  M  is  a  bound  on  the  loss. 


■Mf 


n{n  + 


nfd'r 


Error 


+  M  J 


The  proof  of  Theorem  6.17  requires  fwo  separafe  sfeps.  If  requires  firsf  fo 
bound  fhe  mean  and,  second,  fo  bound  fhe  variance  of  fhe  generalizafion  error. 
The  variance  is  bounded  by  using  error  sfabilify;  fhe  mean  is  bounded  by  using 
CV  sfabilify.  Then  McDiarmid  inequalify  is  used. 

Nexf  we  will  firsf  show  (in  (a))  fhaf  our  definifion  of  error  sfabilify,  wifh  fhe 
leave-one-ouf  perfurbafion,  and  Kufin's  and  Niyogi's  definifion  differ  by  af 
mosf  a  facfor  2  (fhis  relafes  flError  and  6 Error  fo  fifrror  and  6frror).  Then  (in 
(b))  we  will  direcfly  bound  fhe  mean  of  fhe  generalizafion  error. 

(a)  If  we  have  wifh  probabilify  1  —  d Error 

\I[fs]  -  I[fsA\  ^  f Error, 


fhen  wifh  probabilify  1  —  26 Error 

\I[fs]-I[fsA\<2P'Error- 

This  holds  because  of  fhe  following  inequalifies 

\i{fs]  -  nfsA\  =  \nfs]  -  i[fsA + nfs^]  -  nfsA\ 

<  \I[fs]-I[fsf\  +  \I[fsf-I[fsA\- 

This  allows  us  fo  replace  our  definifion  of  error  sfabilify  wifh  fhaf  used  in  [14]. 

(  b)  In  fhe  proof  of  Theorem  6.17  in  [14]  fhe  ferms  fdcv  and  5cv  are  used  fo 
bound  fhe  following  quanfify 

TEsAVifs,  z)  -  Vifsc^ ,  z)]  <  /3cv  +  M6cv, 
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where  M  is  the  upper-bound  on  the  error  at  a  point.  We  need  a  similar  upper 
bound  using  CVioo  stability  instead  of  C14o  stability.  We  rewrite  the  left-hand 
side  of  fhe  above  inequalify  as 

TEsAV{fs,z)  -  V{fsi.,z)]  =  IEs[/[/s]  -  Is[fs]],  (31) 

which  is  fhe  expecfafion  wifh  respecf  fo  S  of  fhe  generalization  error  of  fs- 
The  following  inequalify  holds 

IEs[/[/s]-/s[/s]]  =  TEs[nfs^]-nfs^]  +  nfs]-Is[fs]] 
|Es[/[/s]-/s[/s]]|  <  |IEs[/[/s^]-/s[/s]]|  +  |i?s[/[/s]-/[/s^]]| 

<  |IEs[J[/5i]  —  /s[/s]]|  +  ^Error  +  MSError, 

fhe  lasf  sfep  following  from  error  sfabilify. 

From  CV loo  sfabilify  fhe  following  inequalify  holds 

-  Fs[/s]]|  <  T^s\V{fsi,Zt)  -V{fs,  Zi)\  <  (3loo  +  M6loo- 

We  have  now  done  sfeps  (a)  and  (b).  We  resfafe  fhe  KN  fheorem  in  ferms  of 
our  definifions  of  CVjoo  sfabilify  and  error  sfabilify: 

L-stahility  implies  that  the  bound  in  Equation  4.1  holds  for  all  e,S  >  0.  Then  for  any 
fs  obtained  by  ERMfrom  a  dataset  S  and  for  any  £  >  0 

lim  supPs{|/s[/s]  -  /[/s]|  >  e}  =  O.D 

n— ^oo  ,, 

6.3  Compactness  and  stability 

Convexity 

In  the  proofs  used  here,  fhe  convexify  of  H  and  V  plays  a  key  role.  Given  any 
fwo  functions  fi  and  /2  in  a  convex  H,  fheir  average  fA^x)  =  ^{fi{x)  +  f2ix)) 
is  also  in  H.  Furfhermore,  if  V  is  convex,  for  any  z  G  Z, 

i{fA{z))<lmi{x))+£{f2{x))). 

When  V  is  fhe  square  loss,  and  M  is  a  bound  on  (/(x)  —  y),  we  have  fhe  fol¬ 
lowing  lemma. 

Lemma  6.1  •  If  \e{fi{z))  -  f(/2(^))|  >  di,  then  \fi{x)  -  f2{x)\  >  ^/M  - 

VM  -  di. 

•  lf\h{x)  -  f2{x)\  =  df,thenI{fA{z))  +  ^  =  \{I{fi{z))  +  I^f^^z))). 

•  lf\t{h{z))-e{f2{z))\>dEthenI{fA{z))+^'^-^^f^^  <  \{t{h{z))  + 
t{f2{z))). 
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The  first  part  is  obvious  from  inspection  of  the  loss  function.  The  second  part  is 
a  simple  algebraic  exercise.  The  third  part  is  simply  a  combination  of  the  first 
and  second  parts. 

Proof 

•  Remark  Notice  that  under  our  h5rpotheses  the  minimizer  of  the  expected 
risk  exists  and  is  unique  when  Ti  is  convex  and  appropriate  regularity 
assumptions  on  the  measure  hold  [5]  (see  also  [17]  p.  5).  The  minimizer 
of  the  empirical  risk  is  not  unique  in  general,  but  is  unique  with  respect 
to  the  empirical  measure  over  the  training  set. 

We  first  sketch  a  proof  of  sufficiency,  that  is  compactness  and  convexity  of  H  im¬ 
plies  ((3,  S)-hypothesis  stability: 

Proof: 

Since  H  is  compact  and  convex  given  appropriate  regularity  assumptions  on 
the  measure  there  is  a  unique  function  f*  that  minimizes  the  expected  risk  [5]. 
We  first  compute  the  probability  that  fs  and  f*  are  close  in  expected  risk: 

\i[fs]-nn\  <  e 
\I[fsf-I[n\  <  e. 

Lemma  6.2  With  probability  1  —  d 

\I[fs]-I[n\<e, 

where  e  =  and  d  =  N{e,C)  0  <  r  <  1/2. 

Proof: 

The  function  f*  has  error  rate  /[/*]  =  y.  We  define  the  set  of  functions  fg  as 
follows 

fg  =  {f  for  which  \I[f]  -  I[f*]\  <  s}. 

By  Chemoff 's  inequality  for  a  function  fg 

Fs(/s[/^]>r7  +  e/2)<e-^'"/s^^ 

We  define  the  set  of  functions  ft  as 

fb  =  {f  for  which  |/[/]  -  I[f*]\  >  e}. 

Also  by  Chernoff 's  inequality  for  a  function  ft, 

Thus  for  fs  to  not  be  e  close  to  /*,  at  least  one  of  the  following  events  must 
happen: 

minis]/]  <  y  +  s 
fefb 

max  Is]/]  >  y  +  s 
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The  following  uniform  convergence  bound  fhen  holds: 


TPmfs]  -  nni  >e)<  U{S,C) 

where  Af(e,£)  is  fhe  covering  number.  Setting  e  =  ^  where  0  <  r  < 

1  /2  gives  fhe  resulf  sfafed.  □ 

The  above  lemma  implies  fhaf  wifh  probabilify  1  —  <5  fhe  following  inequalify 
holds: 

/M\ 

i/[/s]-/[r]i< 

wifh 

We  now  show  (by  way  of  fhe  confraposifive)  fhaf  under  fhe  assumpfions  of 
compacfness  and  convexify,  a  funcfion  fhaf  is  close  fo  fhe  optimal  function  in 
generalization  error  is  also  close  in  fhe  sup  norm: 

sup  |/5(x)  -  r  (x)!  >  2e  ^  \I[fs]  -  I[n\  > 

for  some  consfanf  c. 

We  will  use  Arzela's  fheorem  [13]. 

Theorem  (  Arzela  and  Ascoli):  A  necessary  and  sujficient  condition  for  a  family  of 
continuous  functions  f  G  tF  defined  on  a  compact  set  X  to  be  relatively  compact  in 
C{X)  is  that  T  be  uniformly  bounded  and  equicontinuous. 

Given  fhaf 

sup  |/s(a;)  -  f*ix)\  >  2e, 

X 

we  define  fhe  following  sef: 

X'^  =  {x:  \fsix)  -  f*{x)\  >  e}. 

By  equiconfinuify  we  know  fhaf  p{X^)  >  c,  where  fhe  equiconfinuify  allows 
us  fo  lower  bound  fhe  measure  of  fhe  domain  over  which  fhe  functions  differ 
in  fhe  supremum.  This  consfanf  c  will  exisf  for  sufficienfly  regular  measures; 
when  if  exisfs,  ifs  value  will  depend  on  fhe  measure. 

If  we  define  a  funcfion  =  \{fs  +  /*)/  the  following  holds: 

'ix  &  X  ■.  h[f'\  <  i(4[/]  +  4[/*]) 

VxG  A^:4[/']  +  ^  <  l(4[/]+4[r]) 

where  Ix  [/]  =  f^V (f(x) ,  y)dp{y),  fhe  expecfed  risk  of  /  af  fhe  poinf  x. 
Combining  fhe  above  inequalifies  over  all  poinfs  x  G  X,  we  see  fhaf 

/[/A]  +  ^<i(/[/5]  +/[/*]). 
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But  f*  is  the  minimizer  of  /[/],  so  we  also  have  that 

/[r]  +  ^<i(/[/s]+/[r]). 

From  the  above  we  have  that 

i[n  +  "-^<l{i[fs]+i[n) 

-  i[n  +  ^<i[fs] 

-  i/[/]-/[r]i>^. 

From  this  we  can  conclude  that  if  fhe  difference  in  fhe  loss  of  a  function  from 
fhe  optimal  function  is  bounded  fhen  so  is  fhe  difference  in  fhe  sup  norm. 
Since  we  know  fhaf  wifh  probabilify  1  —  6, 

where  (3  =  (^)  ,  fhen  also  wifh  probabilify  1  —  6, 

sup|/s(a;)  -  f*{x)\  <  2 

X 

and,  applying  fhe  same  argumenf  fo  /gi  (we  use  fhe  uGC  consfanfs  associafed 
wifh  a  framing  sef  of  size  n  —  1  for  bofh  S  and  S'®),  wifh  probabilify  1  —  26, 

sup|/s(x)  -  /si(x)|  <  4 

X 

Since  f  G  His  bounded  and  so  is  Y,  fhe  square  loss  has  a  Lipschifz  properly 
fhen 

sup|/s(a;)  -  fsi{x)\  <  4 

X 

implies 

snp\V{fs,z)  -V{fsi,z)\  < 

z  C 

which  is  (/?,  (5)-h5q30fhesis  sfabilify.D 

•  Remark.  The  proof  presenfed  here  obviously  relies  crucially  on  convex- 
ify.  If  is  possible  fo  replace  convexify  wifh  ofher  assumpfions  buf  if  seems 
fhaf  compacfness  alone  will  nof  suffice.  For  example,  consider  fhe  simple 
case  where  X  consisfs  of  a  single  poinf  x,  H  consisfs  of  fhe  two  functions 
f{x)  =  1  and  f{x)  —  —1,  and  y{x)  fakes  on  fhe  values  1  and  —1,  each 
wifh  probabilify  1 .  The  minimizer  of  fhe  frue  risk  is  non-unique  here  — 
bofh  functions  have  identical  frue  risk.  However,  a  simple  combinatorial 
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argument  shows  fs  and  fgi  will  only  be  different  functions  with  proba¬ 
bility  0{^/ri),  so  we  still  have  {f3,  (5)-stability.  A  possible  replacement  for 
fhe  convexify  assumption  is  fo  relax  our  definition  of  (/3,  (f)-h5rpofhesis 
sfabilify  so  fhaf  if  may  nof  hold  on  sefs  of  measure  zero.  Anofher  possi- 
bilify  seems  fo  be  fhe  assumption  fhaf  fhe  fargef  funcfion  is  confinuous. 

We  now  prove  necessify,  fhaf  is  ((3,  S)-hypothesis  stability  implies  compactness  of 

n. 

Proof: 

Since  (/?,  (5)-hypofhesis  sfabilify  implies  CVioo  sfabilify,  is  a  uGC  class  be¬ 
cause  of  Theorem  3.2.  Suppose  fhaf  (7f ,  Loo)  is  nof  compacf.  Then,  by  Arzela's 
fheorem,  Ti.  is  nof  equiconfinuous.  In  particular,  fhere  exisfs  an  >  0,  a  se¬ 
quence  — >  0  and  a  sequence  of  funcfions  fi  CH  safisfying 

Vf,  3x,x'  G  A  s.t.  \x  -  x'\  <  6i,  \  fi{x)  -  Mx')\  >  en  (32) 

We  firsf  nofe  fhaf  each  individual  fi,  being  confinuous  over  a  compacf  domain, 
is  uniformly  confinuous,  and  using  fhis  facf,  if  is  easy  fo  show  fhaf  given  f, 
fhere  exisfs  a  k  such  fhaf  Vf'  >  k,  \fi  —  fi'  \  > 

Now,  consider  {H,  L2),  fhe  sef  of  funcfions  H  equipped  wifh  fhe  L2  norm.  This 
sef  is  fofally  bounded,  because  a  uGC  class  is  fofally  bounded  in  L2  (Shahar 
Mendelson,  personal  communication)^^.  Therefore,  any  infinife  sequence  in 
{H,  L2)  confains  a  fundamenfal  (Gauchy)  subsequence.  Using  fhis  properfy, 
define  fhe  sequence  gi  fo  be  a  subsequence  of  fhe  fi  referred  fo  in  Equafion  32 
which  is  Gauchy  in  (7f ,  L2). 

The  sequence  gi  converges  in  measure  in  fhe  L2  mefric  fo  some  (possibly  dis¬ 
continuous)  funcfion  /*,  which  in  furn  implies  fhaf  gi  confains  a  subsequence 
hi  which  converges  almosf  everywhere  fo  f*  ([13]  p.  292,  Problem  9). 

Nexf,  consfrucf  a  disfribufion  D  fhaf  is  uniform  over  fhe  inpuf  space  X,  wifh 
"fargef"  y{x)  =  f*{x).  Gonsider  any  finife  sample  S  from  D.  For  i  sufficienfly 
large,  all  hi'  for  i'  >  i  will  be  e-empirical  risk  minimizers,  and  fherefore  any  of 
fhem  may  be  refumed  as  eifher  fs  or  fgj . 

Gonsider  one  such  hi.  Lef  be  fhose  x  safisfying  |/*(x)  —  hi{x)\  > 
Suppose  p{Xf)  >  0.  Because  fhe  hi  are  converging  almosf  everywhere  fo  /*, 
fhere  will  exisf  some  i'  and  some  x  in  fhis  sef  for  which  \  f*{x)  —  h[{x)\  < 

af  fhis  poinf  fhe  losses  of  hi  and  h'  will  differ  by  af  leasf  showing  we  do 

nof  have  {P,  (5)-h5q)ofhesis  sfabilify  wifh  /?  <  and  5^0.  The  only  ofher 
possibilify  is  fhaf  /x(A^J  =  0,  and  |/*(a;)  —  hi{x)\  <  ^  almosf  everywhere.  In 
fhis  case,  because  each  hi  is  confinuous  over  a  compacf  domain  and  fherefore 
uniformly  confinuous,  find  6  such  fhaf 

\x  -  x'\  <  5  ^  \hfx)  -  hi{x')\  < 

O 

L2)  is  not  (in  general)  compact,  because  it  is  (in  general)  incomplete. 
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which  in  turn  implies  that 


\x-x'\<5,  x,x  e  X  -  XI.  ^  \f*{.x)  -  f*{x')\  <  —. 

Choose  an  h[  such  that 

3x,  x'  s.t.  \x  —  x'\  <  \h[{x)  —  h'(x)|  >  e-n- 

Because  h'  is  uniformly  continuous,  there  will  exist  sets  X'^  and  X~ ,  both  of 
posifive  measure,  such  fhaf  for  any  x  G  X~^  and  any  x'  G  X~, 

\x  -  x'\  <  S,  \h',{x)  -  h',{x')\  >  Y' 

Since  fhese  sefs  bofh  have  posifive  measure,  fhere  musf  exisf  a  pair  x  and  x'  in 
X  —  X^  where  all  of  fhe  following  hold: 

•  -  hi{x)\  <  ^ 

•  -  h,ix')\  <  ^ 

.  |/*(x)-r(a:')|<^ 

.  \h[{x)-hr{x')\>^-f 

Then  af  af  leasf  one  of  fhe  fwo  poinfs  x  and  x'  referenced  in  fhe  above  (say  x), 
we  have  \  f*{x)  —  hi{x)\  <  ^  buf  \  f*{x)  —  h'(x)|  >  again  showing  fhaf  we 
do  nof  have  {(3,  (5)-h5^ofhesis  sfabilify. 

Assuming  fhe  combination  of  {(3,  (5)-h5^ofhesis  sfabilify,  uGC  properfy,  and 
non-compacfness  led  fo  a  confradicfion;  we  fherefore  conclude  fhaf  (/?,  (5)-hypofhesis 
sfabilify  combined  wifh  uGC  implies  compacfness.  □ 

•  Remark.  Alfhough  {(3,  (f)-h5rpofhesis  sfabilify  is  sfafed  as  a  condifion  on 
fhe  relationship  befween  fs  and  /gi,  under  fhe  condifions  discussed  in 
fhis  paper  (namely  convexify),  we  find  fhaf  fhe  key  condifion  is  insfead 
uniqueness  of  fhe  ERM  function  fs  for  large  dafasefs.  In  particular,  if  is 
possible  fo  show  fhaf  if  H  is  nof  compacf,  fhen  we  can  consfrucf  a  sifu- 
afion  where  fhere  will  be  functions  wifh  risk  arbifrarily  close  fo  /*  fhaf 
are  far  from  f*  in  fhe  sup  norm  over  a  sef  wifh  posifive  measure.  This 
leads  fo  a  sifuafion  where  fs  and  /gi  are  bofh  sefs  of  functions  confain- 
ing  elemenfs  fhaf  differ  in  fhe  sup  norm;  since  any  of  fhese  functions  can 
be  picked  as  eifher  /g  or  /gi,  we  will  nof  have  (/3,  (f)-h5q)ofhesis  sfabilify. 

In  confrasf,  if  we  addifionally  assume  compactness  and  convexity,  this 
cannot  happen  —  that  a  function  cannot  simultaneously  be  far  from  f* 
in  fhe  sup  norm  and  arbifrarily  close  fo  f*  in  risk. 

The  Realizable  Setting 
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We  say  that  the  setting  is  realizable  when  there  is  some  fo  G  H  which  is  con¬ 
sistent  with  the  examples.  In  this  case  there  is  a  simpler  proof  of  fhe  sufficienf 
parf  of  fhe  conjecfure. 

Theorem  When  the  setting  is  realizable,  compactness  ofH  implies  (l3,S)-hypothesis 
stability). 

Proof 

Consider  fhe  family  of  functions  C  consisting  of  £{z)  =  (/(x)  —  y)"^.  If  H  is 
bounded  and  compacf  fhen  £  G  C{Z)  is  a  compacf  sef  of  confinuous  bounded 
functions  wifh  fhe  norm  ||f||  =  sup^g^  Lef  S  =  (x,  y)"  wifhz  =  {zi, z„} 

a  sef  of  poinfs  in  Z.  Take  a  covering  of  Z  defined  in  ferms  of  a  sef  of  n  disks 
Di{z,  i^{z)),  where  i>{z)  is  fhe  smallesf  radius  sufficienf  for  fhe  union  of  fhe  n 
disks  fo  cover  Z.  Taking  a  hinf  from  a  proof  by  Ponfil  (manuscripf  in  prepara- 
fion)  we  wrife  fhe  norm  in  C'(Z)  in  ferms  of  fhe  empirical  norm  w.  r.  f.  fhe  sef 
z  for  £  G  £: 

sup|f(2;)|=  max  {  sup  K(-2)|}  (33) 

zez  i=l,...,n  2(zDi(z,v(z)) 

The  above  equalify  implies 

sup|f(/s(z)) -f(/sj)(2:)|  =  max  {  sup  \£{fs{z))-£{fsi{z))\}  (34) 

zez  1=1,. ..,n  z^Di(z,v(z)) 

which  can  be  rewritten  as 

sup  \£{fs{z))  -  £{fsi{z))\  =  ,  inax  {  sup  \£Us{z))  -  £{fs{zi))  + 

zez  1=1,. ..,n  zGDi(z,u(z)) 

+£{fs(,Zi))  -  i£{fsiiz))-£ifsiizf))-£{fsiizi))\}, 

leading  fo  fhe  following  inequalify 

sup|f(/s(z)) -f(/si(2:))|  <  max  \£{fsizi))  -  £{fsiizi))\  +  0{e{n{S))),  (35) 

zGZ  1=1,. ..,n 

in  which  we  bound  separafely  fhe  variation  of  £{fs)  and  f (/gj )  wifhin  each  Di 
using  fhe  equiconfinuify  of  £.  We  assume  fhaf  under  regularify  conditions  (see 
argumenf  below)  on  fhe  measure,  fhe  radius  of  fhe  disks  n  goes  fo  0  as  n  ^  oo. 
Under  fhe  realizable  setting  assumptions  all  buf  one  of  fhe  n  ferms  in  fhe 
max  operafion  on  fhe  righf-hand  side  of  fhe  inequalify  above  disappear  (since 
£{fs{x))  =  0  when  x  G  S).  (/?,  (5)-LOO  sfabilify  (which  follows  from  our  main 
fheorem  in  fhe  paper  since  £  is  uGC  because  if  is  compacf)  can  now  be  used  fo 
bound  \£{fs{zi))—£{  fsi  (2;^ ) )  |  for  fhe  only  non-zero  f  erm  (fhe  one  corresponding 
fo  j  =  i,  eg  fhe  disk  cenfered  on  fhe  poinf  which  is  in  S  buf  nof  S^).  Thus 
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Fs|sup|£(/s(z))-£(/s,(z))|  </3Loo  +  0(eK^)))|  >1-5  (36) 

with  I3loo,  e(^('S'))  and  5  all  going  to  0  as  n  ^  oo.  This  is  (/3, 5)-h5^othesis 
stability.  Since  L  is  compact,  it  is  also  uGC  and  thus  fast  error  stability  holds. 

To  show  i'{S)  0,  consider  an  arbitrary  radius  for  fhe  disks  a  >  0.  Call 

5i  =  P{Di,  a)  fhe  probabilify  of  a  poinf  in  Z  fo  be  covered  by  fhe  disk  Di  when 
fhe  poinfs  Xi  for  i  =  1, ...  ,n  are  sampled  (in  general  fhe  Si  are  differenf  from 
each  ofher  because  fhe  measure  p  is  in  general  nof  uniform  in  X).  Consider 
S  =  min.  Si.  Notice  fhaf  S  =  O(a^),  where  k  is  fhe  dimensionalify  of  Z.  Thus  an 
arbifrary  poinf  x  G  X  will  be  covered  by  af  leasf  one  disk  wifh  high  probabilify, 
p  =  1  —  ((1  —  5)”).  If  is  easy  fo  see  fhaf  for  n  ^  oo,  fhe  probabilify  of  a  cover 
p  ^  1,  while  simulfaneously  fhe  disk  shrinks  (a  0)  slowly  enough  -  wifh  a 
rafe  n““  where  a  >  0,  afc  <  1  (because  lim„^oo(l  —  n“^)"  =  0  if  afc  <  1). 

•  Remark.  Using  fhe  assumpfion  of  convexify,  fhis  proof  can  be  exf ended 
fo  fhe  non-realizable  (general)  case.  Suppose  fhaf  fhe  difference  af  fhe  leff 
ouf  framing  poinf  Zi  is  less  fhan  or  equaU^  fo  /3: 


-  £(fs(zi}}  < /3- 

Then  we  will  also  be  able  fo  bound  fhe  difference  in  loss  af  all  fhe  ofher 
framing  poinfs.  Suppose  we  have  a  framing  poinf  Zj  for  which 

\^ifs'izj))-£{fsizj))\  >  b. 

Then,  by  our  convexify  lemma,  af  zj,  fhe  loss  of  fhe  average  funcfion 
fA  =  5(/s  +  /sO  satisfies 

^ifA{Zj))  +  c{b)  <  ^Wsizj))  +  eifsiizj))), 

where  c{b)  =  Summing  over  fhe  sef  S'*,  we  have 

Is*  [/a]  + 

^  Is*[fs*]  +  ^<Is*[fs], 
n  —  I 

using  fhe  facf  fhaf  fs*  is  fhe  minimizer  of  Is*  [•] .  Buf  using  fhe  facf  fhaf  fs 

don't  need  an  absolute  value  here,  because  the  loss  of  fs  at  the  left  out  point  is  always  less 
than  the  loss  of  fgi  at  that  point. 
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minimizes  /s[-]. 


Isifs]  <  Isifs*] 

- — -Is4fs]  +  -V{fs{xi),yi)  <  - — -Isi[fsi]  +  -V{fsi{xi),yi) 
n  n  n  n 

Is'lfs]  H - ^V{fs{xi),yi)  <  Isilfsi]  H - ^—rV{fs'{xi),yt) 

n  —  1  n  —  1 

Is*  [/s]  <  Is*  [fs*]+  ■ 

Combining  the  above  derivations,  we  obtain 

Is*[fs*]  +  ^<Is*[fs*]  +  ^^ 

n  —  1  n  —  1 

2c(6)  <  p. 


If  b  is  a  bound  on  the  difference  af  fhe  leff  ouf  poinf,  fhen  af  every  ofher 
framing  poinf,  fhe  difference  in  loss  is  less  fhan  b,  where 

2  ~ 

As  b  ^  0,  6  ^  0  as  well,  finishing  fhe  proof.  Nofe  fhaf  we  do  nof  provide 
a  rafe. 


•  Remark.  The  assumpfion  of  realizabilify  (fhis  sefup  is  also  called  proper 
learning)  is  a  sfrong  assumpfion.  Throughouf  fhis  work,  we  have  essen¬ 
tially  used  compacfness  (equiconfinuify)  fo  enforce  a  condition  fhaf  our 
space  will  nof  confain  functions  fhaf  approach  fhe  minimizer  in  risk  wifh- 
ouf  also  approaching  if  in  fhe  sup  norm.  If  we  assume  realizabilify,  fhen 
we  can  of  fen  gef  fhis  properfy  wifhouf  needing  compacfness.  For  in- 
sfance,  consider  fhe  simple  class  of  functions  (from  [0, 1]  fo  [0, 1])  where 
fi{x)  =  min(fa;,  1).  This  is  a  uGC  buf  non-compacf  class  of  functions.  If 
we  choose  fhe  (non-realizable)  fargef  function  fo  be  f*{x)  =  1,  we  find 
fhaf  we  don'f  gef  (/?,  5)-h5rpofhesis  sfabilify.  However,  if  we  require  fhaf 
fhe  fargef  acfually  be  some  fi,  we  will  recover  (/3,  (f)-h5q)ofhesis  sfabilify; 
essentially,  fhe  slope  of  fhe  fargef  fi  acfs  as  an  equiconfinuify  consfanf  — 
fhose  fj  fhaf  slope  foo  much  more  rapidly  will  nof  have  empirical  risk 
equivalenf  fo  fi  for  sufficienfly  large  samples.  While  if  is  nof  quife  frue 
in  general  fhaf  realizabilify  plus  uGC  — >  (/3,  (f)-h5q)ofhesis  sfabilify  (in  fhe 
above  example,  if  we  add  f{x)  =  1  to  H  we  lose  our  (/3,  (5)-h5q)ofhesis 
sfabilify),  we  conjecfure  fhaf  a  slighfly  weaker  sfafemenf  holds  —  if  is 
uGC  and  fhe  fargef  funcfion  is  in  H,  fhen  we  will  gef  (/3, 5)-h5q)ofhesis 
sfabilify  over  all  of  Z  excepf  possibly  a  sef  of  measure  0. 

Convex  hull 

We  sfafe  a  simple  application  of  fhe  previous  resulfs,  fogefher  wifh  an  obvious 
properfy  of  convex  hulls. 
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Lemma  The  convex  hull  He  ofH  is  compact  if  and  only  ifH  is  compact. 

Theorem  (/3,  S)-hypothesis  stability  ofERM  on  He  for  any  measure  with  appropriate 
regularity  conditions  is  necessary  and  sufficient  for  compactness  of  a  uGC  H. 
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Figure  1:  An  overall  view  of  some  of  the  properties  discussed  in  this  paper  and  their 
relations.  Arrows  are  to  be  read  as  implies:  for  example  a  =>  b  means  a  implies  b. 
For  ERM  the  classical  result  is  that  generalization  and  consistency  imply  that  H  is 
uGC  and  are  implied  by  it.  The  other  relations  represent  some  of  the  new  results  of  this 
paper.  In  the  diagram  we  can  substitute  the  combination  ofEioo  and  EEioo  stability 
with  EloOerr  Stability.  Notice  that  for  ERM  generalization  implies  consistency.  As  an 
example  of  a  non-ERM  algorithm,  Tikhonov  regularization  implies  uniform  hypothesis 
stability  which  implies  both  CVEEEioo  (ie  CVioo,  Eioo  and  EEioo  stability)  and  LOO 
stability  (ie  CVioo  and  EloOerr)-  Note  that  Ivanov  regularization  in  a  RKHS  is  an 
example  of  ERM,  whereas  Tikhonov  regularization  is  not  ERM. 
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